Closed yetudada closed 1 year ago
https://github.com/kedro-org/kedro/issues/2924#issuecomment-1688124435
My refactor exercise repository (see the original notebook and refactor notebook)
branch: refactor-with-library
and refactor-kedro-pipeline
(only partially for the pre-processing pipeline)
yaml
+ DataCatalog
OmegaConfigLoader
+ DataCatalog
I started with a notebook, I put it in folder/nested_folder
which is an unusual location to try to break as much assumptions as I can. During refactoring, I try to do one change at a time and re-run the program to ensure the logic isn't changed, but sometime it is not possible.
Templating with parameters.yml
quickly hit the plateau and requires OmegaConfigLoader
for templating advance types. I use OmegaConf.register_new_resolver("T", dynamic_injection,replace=True)
to handle types like torch.bfloat16
or arbitrary type
Using OmegaConfigLoader
in notebook is not easy. You need OmegaConfigLoader(".",base_env="",default_run_env="")
DataCatalog
does not takes conf_source
like ConfigLoader
, a function _convert_paths_to_absolute_posix
is used in KedroContext
instead.
AutoModelForCausalLM.from_pretrained
takes positional and keyword argument. How do I specify it in a node?
tokenizer
is a global singleton get used in functions (node) implicitly, which means it is used in a function but not passed as an argument. When you run it in a notebook it's fine because everything running sequentially, but not true when you run it as a kedro node. Kedro needs "Pure Python Function", but there are stateful class being pass around. For example, LabelEncoder
need to be fit
before transform
, changing the order will change the code
%reload_kedro
is slow - but I need to do this a lot when I refactor code
%loadext autoreload 2
and %autoreload 2
is helpful - but it does not works everytime, %reload_kedro
is still needed to refresh nodes.py code sometime (not sure yet)
Hot reload in parameters.yml
and catalog.yml
comes in handy during refactoring - are we sure we want to deprecate it? Removing KedroContext params and catalog hot-reload
Undocumented Feature, you can change src
with source_dir
in pyproject.toml
Having parameters.yml
is the first step but it doesn't end there. You may want to organise your parameters group, but sometimes you found that Kedro will force you to organise it in a way (not intended). i.e. 5. Kedro doesn't support full Python function signature
For example
def my_function(*model_kwargs, **kwargs):
return "something"
6., 7. and 8. are related to the larger milestone Improve the usability and debugging experience for Jupyter notebooks
2023-08-23
Discussion for DataCatalog path conversion issue
DataCatalog.from_config
and add a new argument data_source
(similar to conf_source
)?ConfigLoader
because DataCatalog
expects a dict
and the responsibility to read config falls into ConfigLoader
? It's blurry because it's not exactly "reading" config, it changes the content of catalog.yml
DataCatalog._convert_path
or DataCatalog.convert_path
conf_catalog = config_loader["catalog"] # config
conf_catalog = _convert_paths_to_absolute_posix(Path("../../").resolve(), conf_catalog) # config/catalog
catalog = DataCatalog.from_config(conf_catalog) # catalog
catalog.load("example_data")
Discussed Solution
path
parameters.root
to change the default root. i.e. OmegaConfigLoader(root="some_path")
following Raised by @astrojuanlu """
test_ds:
type: pandas.CSVDataSet
filepath: "${path:data/01_raw/thing.csv}"
"""
Notes:
@SajidAlamQB and @ankatiyar thinks that configs should be handle by config_loader
because it clarifies the responsibility of reading config.
data_source
, root
etc in ConfigLoader
and resolve it automatically with `config_loader["catalog"]@deepyaman, @astrojuanlu ,@noklam and @amandakys favors to change DataCatalog.from_config(new_argument)
data_source
or root
DataCatalog.from_config(file_parameters=["filepath", "path"])
config_loader
, user will be forced to use config loader with DataCatalog
.The decision is to change DataCatalog.from_config
, we haven't decided what should be the name of the argument.
Closing as the exercise is finished and we identified a few key issues. Relevant comments are spread into these issues: https://github.com/kedro-org/kedro/issues/1460, https://github.com/kedro-org/kedro/issues/2819#issuecomment-1669771871
Description
This follows the insights from #2901; one of the issued flagged is how we make it easier to use Kedro with existing projects. This task focuses on putting yourself into the shoes of our users as they refactor their work into adopting Kedro.
Possible Implementation
The scope of this task involves taking this project and converting into a project which uses Kedro as a:
You can work with the post authors, Daisy Wood and Mitchell West. Mitchell additionally added: "Happy to work with the Kedro team for guidance getting our approach up and running if it would be helpful? Next steps for this approach is to hopefully create a standalone python package, so would be great to see what a Kedro implementation would look like."
Outcomes
Share your learnings on what steps you took and what the process was like. It would be great to see a process diagram and possible workflow improvements that could be made.