kedro-org / kedro-starters

Templates for your Kedro projects.
Apache License 2.0
63 stars 57 forks source link

Update starters to support pandas 2.0 #127

Closed astrojuanlu closed 1 year ago

astrojuanlu commented 1 year ago

Description

Currently the starters can break if a user has pandas 2.0 installed. Update all starters so they can run fine with pandas 2.0 as well as older versions. This means updating the pin for kedro-datasets to ~=1.0 instead of ~=1.0.0.

Context

For example in spaceflights:

This should not be a problem if the user follows the normal workflow, but if they install pandas 2 separately, things break:

> pip install kedro pandas scikit-learn openpyxl pyarrow  # problems incoming
> kedro new --starter=spaceflights
> cd spaceflights
> kedro run  # uh oh
[05/05/23 15:25:57] INFO     Kedro project spaceflights                                                                                                                                                 session.py:360
[05/05/23 15:25:59] WARNING  /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/importlib/__init__.py:126: DeprecationWarning: `kedro.extras.datasets` is deprecated and will be removed in     warnings.py:109
                             Kedro 0.19, install `kedro-datasets` instead by running `pip install kedro-datasets`.                                                                                                    
                               return _bootstrap._gcd_import(name[level:], package, level)                                                                                                                            

[05/05/23 15:26:00] INFO     Loading data from 'companies' (CSVDataSet)...                                                                                                                         data_catalog.py:343
                    INFO     Running node: preprocess_companies_node: preprocess_companies([companies]) -> [preprocessed_companies]                                                                        node.py:329
                    INFO     Saving data to 'preprocessed_companies' (ParquetDataSet)...                                                                                                           data_catalog.py:382
                    INFO     Completed 1 out of 6 tasks                                                                                                                                        sequential_runner.py:85
                    INFO     Loading data from 'shuttles' (ExcelDataSet)...                                                                                                                        data_catalog.py:343
[05/05/23 15:26:04] INFO     Running node: preprocess_shuttles_node: preprocess_shuttles([shuttles]) -> [preprocessed_shuttles]                                                                            node.py:329
                    ERROR    Node 'preprocess_shuttles_node: preprocess_shuttles([shuttles]) -> [preprocessed_shuttles]' failed with error:                                                                node.py:354
                             could not convert string to float: '$1325.0'                                                                                                                                             
                    WARNING  There are 5 nodes that have not run.                                                                                                                                        runner.py:205
                             You can resume the pipeline run from the nearest nodes with persisted inputs by adding the following argument to your previous command:                                                  
                               --from-nodes "preprocess_shuttles_node,create_model_input_table_node"                                                                                                                  
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/juan_cano/.micromamba/envs/_test310/bin/kedro:8 in <module>                               │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/framework/cli/cli. │
│ py:211 in main                                                                                   │
│                                                                                                  │
│   208 │   """                                                                                    │
│   209 │   _init_plugins()                                                                        │
│   210 │   cli_collection = KedroCLI(project_path=Path.cwd())                                     │
│ ❱ 211 │   cli_collection()                                                                       │
│   212                                                                                            │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/click/core.py:1130 in    │
│ __call__                                                                                         │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/framework/cli/cli. │
│ py:139 in main                                                                                   │
│                                                                                                  │
│   136 │   │   )                                                                                  │
│   137 │   │                                                                                      │
│   138 │   │   try:                                                                               │
│ ❱ 139 │   │   │   super().main(                                                                  │
│   140 │   │   │   │   args=args,                                                                 │
│   141 │   │   │   │   prog_name=prog_name,                                                       │
│   142 │   │   │   │   complete_var=complete_var,                                                 │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/click/core.py:1055 in    │
│ main                                                                                             │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/click/core.py:1657 in    │
│ invoke                                                                                           │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/click/core.py:1404 in    │
│ invoke                                                                                           │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/click/core.py:760 in     │
│ invoke                                                                                           │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/framework/cli/proj │
│ ect.py:472 in run                                                                                │
│                                                                                                  │
│   469 │   with KedroSession.create(                                                              │
│   470 │   │   env=env, conf_source=conf_source, extra_params=params                              │
│   471 │   ) as session:                                                                          │
│ ❱ 472 │   │   session.run(                                                                       │
│   473 │   │   │   tags=tag,                                                                      │
│   474 │   │   │   runner=runner(is_async=is_async),                                              │
│   475 │   │   │   node_names=node_names,                                                         │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/framework/session/ │
│ session.py:426 in run                                                                            │
│                                                                                                  │
│   423 │   │   )                                                                                  │
│   424 │   │                                                                                      │
│   425 │   │   try:                                                                               │
│ ❱ 426 │   │   │   run_result = runner.run(                                                       │
│   427 │   │   │   │   filtered_pipeline, catalog, hook_manager, session_id                       │
│   428 │   │   │   )                                                                              │
│   429 │   │   │   self._run_called = True                                                        │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/runner/runner.py:9 │
│ 1 in run                                                                                         │
│                                                                                                  │
│    88 │   │   │   self._logger.info(                                                             │
│    89 │   │   │   │   "Asynchronous mode is enabled for loading and saving data"                 │
│    90 │   │   │   )                                                                              │
│ ❱  91 │   │   self._run(pipeline, catalog, hook_manager, session_id)                             │
│    92 │   │                                                                                      │
│    93 │   │   self._logger.info("Pipeline execution completed successfully.")                    │
│    94                                                                                            │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/runner/sequential_ │
│ runner.py:70 in _run                                                                             │
│                                                                                                  │
│   67 │   │                                                                                       │
│   68 │   │   for exec_index, node in enumerate(nodes):                                           │
│   69 │   │   │   try:                                                                            │
│ ❱ 70 │   │   │   │   run_node(node, catalog, hook_manager, self._is_async, session_id)           │
│   71 │   │   │   │   done_nodes.add(node)                                                        │
│   72 │   │   │   except Exception:                                                               │
│   73 │   │   │   │   self._suggest_resume_scenario(pipeline, done_nodes, catalog)                │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/runner/runner.py:3 │
│ 19 in run_node                                                                                   │
│                                                                                                  │
│   316 │   if is_async:                                                                           │
│   317 │   │   node = _run_node_async(node, catalog, hook_manager, session_id)                    │
│   318 │   else:                                                                                  │
│ ❱ 319 │   │   node = _run_node_sequential(node, catalog, hook_manager, session_id)               │
│   320 │                                                                                          │
│   321 │   for name in node.confirms:                                                             │
│   322 │   │   catalog.confirm(name)                                                              │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/runner/runner.py:4 │
│ 15 in _run_node_sequential                                                                       │
│                                                                                                  │
│   412 │   )                                                                                      │
│   413 │   inputs.update(additional_inputs)                                                       │
│   414 │                                                                                          │
│ ❱ 415 │   outputs = _call_node_run(                                                              │
│   416 │   │   node, catalog, inputs, is_async, hook_manager, session_id=session_id               │
│   417 │   )                                                                                      │
│   418                                                                                            │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/runner/runner.py:3 │
│ 81 in _call_node_run                                                                             │
│                                                                                                  │
│   378 │   │   │   is_async=is_async,                                                             │
│   379 │   │   │   session_id=session_id,                                                         │
│   380 │   │   )                                                                                  │
│ ❱ 381 │   │   raise exc                                                                          │
│   382 │   hook_manager.hook.after_node_run(                                                      │
│   383 │   │   node=node,                                                                         │
│   384 │   │   catalog=catalog,                                                                   │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/runner/runner.py:3 │
│ 71 in _call_node_run                                                                             │
│                                                                                                  │
│   368 ) -> Dict[str, Any]:                                                                       │
│   369 │   # pylint: disable=too-many-arguments                                                   │
│   370 │   try:                                                                                   │
│ ❱ 371 │   │   outputs = node.run(inputs)                                                         │
│   372 │   except Exception as exc:                                                               │
│   373 │   │   hook_manager.hook.on_node_error(                                                   │
│   374 │   │   │   error=exc,                                                                     │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/pipeline/node.py:3 │
│ 55 in run                                                                                        │
│                                                                                                  │
│   352 │   │   # purposely catch all exceptions                                                   │
│   353 │   │   except Exception as exc:                                                           │
│   354 │   │   │   self._logger.error("Node '%s' failed with error: \n%s", str(self), str(exc))   │
│ ❱ 355 │   │   │   raise exc                                                                      │
│   356 │                                                                                          │
│   357 │   def _run_with_no_inputs(self, inputs: Dict[str, Any]):                                 │
│   358 │   │   if inputs:                                                                         │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/pipeline/node.py:3 │
│ 44 in run                                                                                        │
│                                                                                                  │
│   341 │   │   │   if not self._inputs:                                                           │
│   342 │   │   │   │   outputs = self._run_with_no_inputs(inputs)                                 │
│   343 │   │   │   elif isinstance(self._inputs, str):                                            │
│ ❱ 344 │   │   │   │   outputs = self._run_with_one_input(inputs, self._inputs)                   │
│   345 │   │   │   elif isinstance(self._inputs, list):                                           │
│   346 │   │   │   │   outputs = self._run_with_list(inputs, self._inputs)                        │
│   347 │   │   │   elif isinstance(self._inputs, dict):                                           │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/kedro/pipeline/node.py:3 │
│ 75 in _run_with_one_input                                                                        │
│                                                                                                  │
│   372 │   │   │   │   f"{sorted(inputs.keys())}."                                                │
│   373 │   │   │   )                                                                              │
│   374 │   │                                                                                      │
│ ❱ 375 │   │   return self._func(inputs[node_input])                                              │
│   376 │                                                                                          │
│   377 │   def _run_with_list(self, inputs: Dict[str, Any], node_inputs: List[str]):              │
│   378 │   │   # Node inputs and provided run inputs should completely overlap                    │
│                                                                                                  │
│ /private/tmp/spaceflights/src/spaceflights/pipelines/data_processing/nodes.py:45 in              │
│ preprocess_shuttles                                                                              │
│                                                                                                  │
│   42 │   """                                                                                     │
│   43 │   shuttles["d_check_complete"] = _is_true(shuttles["d_check_complete"])                   │
│   44 │   shuttles["moon_clearance_complete"] = _is_true(shuttles["moon_clearance_complete"])     │
│ ❱ 45 │   shuttles["price"] = _parse_money(shuttles["price"])                                     │
│   46 │   return shuttles                                                                         │
│   47                                                                                             │
│   48                                                                                             │
│                                                                                                  │
│ /private/tmp/spaceflights/src/spaceflights/pipelines/data_processing/nodes.py:16 in _parse_money │
│                                                                                                  │
│   13                                                                                             │
│   14 def _parse_money(x: pd.Series) -> pd.Series:                                                │
│   15 │   x = x.str.replace("$", "", regex=True).str.replace(",", "")                             │
│ ❱ 16 │   x = x.astype(float)                                                                     │
│   17 │   return x                                                                                │
│   18                                                                                             │
│   19                                                                                             │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/pandas/core/generic.py:6 │
│ 324 in astype                                                                                    │
│                                                                                                  │
│    6321 │   │                                                                                    │
│    6322 │   │   else:                                                                            │
│    6323 │   │   │   # else, only a single dtype is given                                         │
│ ❱  6324 │   │   │   new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)           │
│    6325 │   │   │   return self._constructor(new_data).__finalize__(self, method="astype")       │
│    6326 │   │                                                                                    │
│    6327 │   │   # GH 33113: handle empty frame or series                                         │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/pandas/core/internals/ma │
│ nagers.py:451 in astype                                                                          │
│                                                                                                  │
│    448 │   │   elif using_copy_on_write():                                                       │
│    449 │   │   │   copy = False                                                                  │
│    450 │   │                                                                                     │
│ ❱  451 │   │   return self.apply(                                                                │
│    452 │   │   │   "astype",                                                                     │
│    453 │   │   │   dtype=dtype,                                                                  │
│    454 │   │   │   copy=copy,                                                                    │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/pandas/core/internals/ma │
│ nagers.py:352 in apply                                                                           │
│                                                                                                  │
│    349 │   │   │   if callable(f):                                                               │
│    350 │   │   │   │   applied = b.apply(f, **kwargs)                                            │
│    351 │   │   │   else:                                                                         │
│ ❱  352 │   │   │   │   applied = getattr(b, f)(**kwargs)                                         │
│    353 │   │   │   result_blocks = extend_blocks(applied, result_blocks)                         │
│    354 │   │                                                                                     │
│    355 │   │   out = type(self).from_blocks(result_blocks, self.axes)                            │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/pandas/core/internals/bl │
│ ocks.py:511 in astype                                                                            │
│                                                                                                  │
│    508 │   │   """                                                                               │
│    509 │   │   values = self.values                                                              │
│    510 │   │                                                                                     │
│ ❱  511 │   │   new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)           │
│    512 │   │                                                                                     │
│    513 │   │   new_values = maybe_coerce_values(new_values)                                      │
│    514                                                                                           │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/pandas/core/dtypes/astyp │
│ e.py:242 in astype_array_safe                                                                    │
│                                                                                                  │
│   239 │   │   dtype = dtype.numpy_dtype                                                          │
│   240 │                                                                                          │
│   241 │   try:                                                                                   │
│ ❱ 242 │   │   new_values = astype_array(values, dtype, copy=copy)                                │
│   243 │   except (ValueError, TypeError):                                                        │
│   244 │   │   # e.g. _astype_nansafe can fail on object-dtype of strings                         │
│   245 │   │   #  trying to convert to float                                                      │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/pandas/core/dtypes/astyp │
│ e.py:187 in astype_array                                                                         │
│                                                                                                  │
│   184 │   │   values = values.astype(dtype, copy=copy)                                           │
│   185 │                                                                                          │
│   186 │   else:                                                                                  │
│ ❱ 187 │   │   values = _astype_nansafe(values, dtype, copy=copy)                                 │
│   188 │                                                                                          │
│   189 │   # in pandas we don't store numpy str dtypes, so convert to object                      │
│   190 │   if isinstance(dtype, np.dtype) and issubclass(values.dtype.type, str):                 │
│                                                                                                  │
│ /Users/juan_cano/.micromamba/envs/_test310/lib/python3.10/site-packages/pandas/core/dtypes/astyp │
│ e.py:138 in _astype_nansafe                                                                      │
│                                                                                                  │
│   135 │                                                                                          │
│   136 │   if copy or is_object_dtype(arr.dtype) or is_object_dtype(dtype):                       │
│   137 │   │   # Explicit copy, or required since NumPy can't view from / to object.              │
│ ❱ 138 │   │   return arr.astype(dtype, copy=True)                                                │
│   139 │                                                                                          │
│   140 │   return arr.astype(dtype, copy=copy)                                                    │
│   141                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: could not convert string to float: '$1325.0'

I was about to do a quick demonstration of the spaceflights pipeline, and instead of following the normal process, I installed the dependencies "by hand".

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

deepyaman commented 1 year ago

@astrojuanlu This issue only affects the spaceflights starter; everything else upgrades fine. ~It may have something to do with an underlying error from numpy, where the block size changes on cast, but I haven't yet figured this out.~ Will keep you posted if I make progress.

Edit: JK, think this is because of bad code in Spaceflights: x = x.str.replace("$", "", regex=True). If this is regex, it's replacing start of string marker? Which is why you get ValueError: could not convert string to float: '$1325.0' further down.