yetudada commented 1 year ago

Description

This follows the insights from #2901; one of the issued flagged is how we make it easier to use Kedro with existing projects. This task focuses on putting yourself into the shoes of our users as they refactor their work into adopting Kedro.

Possible Implementation

The scope of this task involves taking this project and converting into a project which uses Kedro as a:

Library (so try to use the ConfigLoader, DataCatalog and Pipeline)
Framework (so try to adopt the framework and project template)

You can work with the post authors, Daisy Wood and Mitchell West. Mitchell additionally added: "Happy to work with the Kedro team for guidance getting our approach up and running if it would be helpful? Next steps for this approach is to hopefully create a standalone python package, so would be great to see what a Kedro implementation would look like."

Outcomes

Share your learnings on what steps you took and what the process was like. It would be great to see a process diagram and possible workflow improvements that could be made.

noklam commented 1 year ago

Resource

https://github.com/kedro-org/kedro/issues/2924#issuecomment-1688124435

My refactor exercise repository (see the original notebook and refactor notebook) branch: refactor-with-library and refactor-kedro-pipeline (only partially for the pre-processing pipeline)

Migration Process

Pure Notebook from scratch
Notebook Using yaml + DataCatalog
Notebook using OmegaConfigLoader + DataCatalog
Move the Notebook to a Kedro Pipeline (refactor nodes, pipelines)

Methodology

I started with a notebook, I put it in folder/nested_folder which is an unusual location to try to break as much assumptions as I can. During refactoring, I try to do one change at a time and re-run the program to ensure the logic isn't changed, but sometime it is not possible.

Issues I encountered

Templating with parameters.yml quickly hit the plateau and requires OmegaConfigLoader for templating advance types. I use OmegaConf.register_new_resolver("T", dynamic_injection,replace=True) to handle types like torch.bfloat16 or arbitrary type
Using OmegaConfigLoader in notebook is not easy. You need OmegaConfigLoader(".",base_env="",default_run_env="")
DataCatalog does not takes conf_source like ConfigLoader, a function _convert_paths_to_absolute_posix is used in KedroContext instead.
AutoModelForCausalLM.from_pretrained takes positional and keyword argument. How do I specify it in a node?
tokenizer is a global singleton get used in functions (node) implicitly, which means it is used in a function but not passed as an argument. When you run it in a notebook it's fine because everything running sequentially, but not true when you run it as a kedro node. Kedro needs "Pure Python Function", but there are stateful class being pass around. For example, LabelEncoder need to be fit before transform, changing the order will change the code
%reload_kedro is slow - but I need to do this a lot when I refactor code
%loadext autoreload 2 and %autoreload 2 is helpful - but it does not works everytime, %reload_kedro is still needed to refresh nodes.py code sometime (not sure yet)
Hot reload in parameters.yml and catalog.yml comes in handy during refactoring - are we sure we want to deprecate it? Removing KedroContext params and catalog hot-reload
Undocumented Feature, you can change src with source_dir in pyproject.toml

noklam commented 1 year ago

Theme that will not be discussed in this TD

Python Function vs Nodes

Having parameters.yml is the first step but it doesn't end there. You may want to organise your parameters group, but sometimes you found that Kedro will force you to organise it in a way (not intended). i.e. 5. Kedro doesn't support full Python function signature

Unpack Dictionary https://github.com/kedro-org/kedro/issues/2749
How to support a function or node that takes position-only + keywords-only argument?

For example

def my_function(*model_kwargs, **kwargs):
    return "something"

Hot-reload

6., 7. and 8. are related to the larger milestone Improve the usability and debugging experience for Jupyter notebooks

noklam commented 1 year ago

2023-08-23

Discussion for DataCatalog path conversion issue

Should we move this to DataCatalog.from_config and add a new argument data_source (similar to conf_source)?
Or should we change ConfigLoader because DataCatalog expects a dict and the responsibility to read config falls into ConfigLoader? It's blurry because it's not exactly "reading" config, it changes the content of catalog.yml
Or DataCatalog._convert_path or DataCatalog.convert_path

conf_catalog = config_loader["catalog"] # config
conf_catalog = _convert_paths_to_absolute_posix(Path("../../").resolve(), conf_catalog) # config/catalog
catalog = DataCatalog.from_config(conf_catalog) # catalog
catalog.load("example_data")

Discussed Solution

DataCatalog.from_config(data_source=) 1.1 DataCatalog.from_config(c, path_parameters=("filepath", "file")) - Extra parameters to define the path parameters.
config_loader["catalog"] + an optional argument root to change the default root. i.e. OmegaConfigLoader(root="some_path") following Raised by @astrojuanlu
remove logic from DataCatalog, add custom resolver OmegaConfigLoader(custom_resolvers={"path": lambda p: abspath(...)}) # but it is not backwards compatible
```
"""
test_ds:
type: pandas.CSVDataSet
filepath: "${path:data/01_raw/thing.csv}"
"""
```
somehow the config loader transforms the keys in a backwards compatible way?
substitution for the root of the project """ test_ds: type: pandas.CSVDataSet filepath: "${root}"/data/01_raw/thing.csv """

Notes:

2,3 & 5 requires config_loader
No obvious implementation for 4.
Mostly favor 1 and it variants

@SajidAlamQB and @ankatiyar thinks that configs should be handle by config_loader because it clarifies the responsibility of reading config.
- Some idea about introducing a new parameters data_source, root etc in ConfigLoader and resolve it automatically with `config_loader["catalog"]
@deepyaman, @astrojuanlu ,@noklam and @amandakys favors to change DataCatalog.from_config(new_argument)
- options include data_source or root
- @astrojuanlu also suggested that we may even provide specific pattern to merge. ie.e DataCatalog.from_config(file_parameters=["filepath", "path"])
- @amandakys and @deepyaman point out that if we choose to do it in config_loader, user will be forced to use config loader with DataCatalog.
The decision is to change DataCatalog.from_config, we haven't decided what should be the name of the argument.

noklam commented 1 year ago

[ ] https://github.com/kedro-org/kedro/issues/2962
[ ] https://github.com/kedro-org/kedro/issues/2965
[ ] #2819

noklam commented 1 year ago

Closing as the exercise is finished and we identified a few key issues. Relevant comments are spread into these issues: https://github.com/kedro-org/kedro/issues/1460, https://github.com/kedro-org/kedro/issues/2819#issuecomment-1669771871

kedro-org / kedro

Convert an existing project into a project using Kedro as a Library + Framework #2924

Description

Possible Implementation

Outcomes

Resource

Migration Process

Methodology

Issues I encountered

Theme that will not be discussed in this TD

Python Function vs Nodes

Hot-reload