kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.92k stars 903 forks source link

Convert an existing project into a project using Kedro as a Library + Framework #2924

Closed yetudada closed 1 year ago

yetudada commented 1 year ago

Description

This follows the insights from #2901; one of the issued flagged is how we make it easier to use Kedro with existing projects. This task focuses on putting yourself into the shoes of our users as they refactor their work into adopting Kedro.

Possible Implementation

The scope of this task involves taking this project and converting into a project which uses Kedro as a:

You can work with the post authors, Daisy Wood and Mitchell West. Mitchell additionally added: "Happy to work with the Kedro team for guidance getting our approach up and running if it would be helpful? Next steps for this approach is to hopefully create a standalone python package, so would be great to see what a Kedro implementation would look like."

Outcomes

Share your learnings on what steps you took and what the process was like. It would be great to see a process diagram and possible workflow improvements that could be made.

noklam commented 1 year ago

Resource

https://github.com/kedro-org/kedro/issues/2924#issuecomment-1688124435

My refactor exercise repository (see the original notebook and refactor notebook) branch: refactor-with-library and refactor-kedro-pipeline (only partially for the pre-processing pipeline)

Migration Process

  1. Pure Notebook from scratch
  2. Notebook Using yaml + DataCatalog
  3. Notebook using OmegaConfigLoader + DataCatalog
  4. Move the Notebook to a Kedro Pipeline (refactor nodes, pipelines)

Methodology

I started with a notebook, I put it in folder/nested_folder which is an unusual location to try to break as much assumptions as I can. During refactoring, I try to do one change at a time and re-run the program to ensure the logic isn't changed, but sometime it is not possible.

Issues I encountered

  1. Templating with parameters.yml quickly hit the plateau and requires OmegaConfigLoader for templating advance types. I use OmegaConf.register_new_resolver("T", dynamic_injection,replace=True) to handle types like torch.bfloat16 or arbitrary type

  2. Using OmegaConfigLoader in notebook is not easy. You need OmegaConfigLoader(".",base_env="",default_run_env="")

  3. DataCatalog does not takes conf_source like ConfigLoader, a function _convert_paths_to_absolute_posix is used in KedroContext instead.

  4. AutoModelForCausalLM.from_pretrained takes positional and keyword argument. How do I specify it in a node?

  5. tokenizer is a global singleton get used in functions (node) implicitly, which means it is used in a function but not passed as an argument. When you run it in a notebook it's fine because everything running sequentially, but not true when you run it as a kedro node. Kedro needs "Pure Python Function", but there are stateful class being pass around. For example, LabelEncoder need to be fit before transform, changing the order will change the code

  6. %reload_kedro is slow - but I need to do this a lot when I refactor code

  7. %loadext autoreload 2 and %autoreload 2 is helpful - but it does not works everytime, %reload_kedro is still needed to refresh nodes.py code sometime (not sure yet)

  8. Hot reload in parameters.yml and catalog.yml comes in handy during refactoring - are we sure we want to deprecate it? Removing KedroContext params and catalog hot-reload

  9. Undocumented Feature, you can change src with source_dir in pyproject.toml

noklam commented 1 year ago

Theme that will not be discussed in this TD

Python Function vs Nodes

Having parameters.yml is the first step but it doesn't end there. You may want to organise your parameters group, but sometimes you found that Kedro will force you to organise it in a way (not intended). i.e. 5. Kedro doesn't support full Python function signature

For example

def my_function(*model_kwargs, **kwargs):
    return "something"

Hot-reload

6., 7. and 8. are related to the larger milestone Improve the usability and debugging experience for Jupyter notebooks

noklam commented 1 year ago

2023-08-23

Discussion for DataCatalog path conversion issue

conf_catalog = config_loader["catalog"] # config
conf_catalog = _convert_paths_to_absolute_posix(Path("../../").resolve(), conf_catalog) # config/catalog
catalog = DataCatalog.from_config(conf_catalog) # catalog
catalog.load("example_data")

Discussed Solution

  1. DataCatalog.from_config(data_source=) 1.1 DataCatalog.from_config(c, path_parameters=("filepath", "file")) - Extra parameters to define the path parameters.
  2. config_loader["catalog"] + an optional argument root to change the default root. i.e. OmegaConfigLoader(root="some_path") following Raised by @astrojuanlu
  3. remove logic from DataCatalog, add custom resolver OmegaConfigLoader(custom_resolvers={"path": lambda p: abspath(...)}) # but it is not backwards compatible
    """
    test_ds:
    type: pandas.CSVDataSet
    filepath: "${path:data/01_raw/thing.csv}"
    """
  4. somehow the config loader transforms the keys in a backwards compatible way?
  5. substitution for the root of the project """ test_ds: type: pandas.CSVDataSet filepath: "${root}"/data/01_raw/thing.csv """

Notes:

noklam commented 1 year ago
noklam commented 1 year ago

Closing as the exercise is finished and we identified a few key issues. Relevant comments are spread into these issues: https://github.com/kedro-org/kedro/issues/1460, https://github.com/kedro-org/kedro/issues/2819#issuecomment-1669771871