kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.84k stars 894 forks source link

Make config loading consistently happen before pipelines are registered to allow for dynamic pipelines with OmegaConf #3093

Open Lasica opened 11 months ago

Lasica commented 11 months ago

Description

Currently the order between pipelines loading and config loading varies, depending on kedro command. If pipelines were to be ever dynamic depending on config/params, then they should always be read before pipelines are registered. Examples:

I wrote hello world functions to pipeline registry and config handler functions to demonstrate the order of loading:

✅ Command `kedro catalog list` ```bash adobrogo@gidpod ..multirunner-demo/spaceflights-multirun (git)-[main] % kedro catalog list Hello config register resolver Hello pipeline registry function ```
❌ Command `kedro run` ```bash adobrogo@gidpod ..multirunner-demo/spaceflights-multirun (git)-[main] % kedro run --namespace price_predictor.base --nodes price_predictor.base.debug_node [09/28/23 14:59:54] INFO Kedro project spaceflights-multirun session.py:364 Hello pipeline registry function Hello config register resolver [09/28/23 14:59:55] INFO Loading data from 'params:price_predictor.base.model_options' (MemoryDataset)... data_catalog.py:492 INFO Running node: debug_node: verbose_params([params:price_predictor.base.model_options]) -> None node.py:331 INFO Verbose debug node reporting nodes.py:60 INFO Argument number:0, Value:{'test_size': 0.2, 'random_state': 3, 'target': 'price', 'features': ['engines', 'passenger_capacity', 'crew', nodes.py:62 'd_check_complete'], 'model': 'sklearn.linear_model.LinearRegression', 'model_params': {'gamma': 3}} INFO Completed 1 out of 1 tasks sequential_runner.py:85 INFO Pipeline execution completed successfully. ```
❌ Command `kedro registry list` ```bash adobrogo@gidpod ..multirunner-demo/spaceflights-multirun (git)-[main] % kedro registry list Hello pipeline registry function - __default__ - data_processing - data_science ```
✅ Command `kedro ipython` ```bash adobrogo@gidpod ..multirunner-demo/spaceflights-multirun (git)-[main] % kedro ipython ipython --ext kedro.ipython Python 3.10.4 (main, Apr 6 2023, 13:50:45) [GCC 12.2.1 20230111] Type 'copyright', 'credits' or 'license' for more information IPython 8.15.0 -- An enhanced Interactive Python. Type '?' for help. [09/28/23 15:01:51] INFO Resolved project path as: /home/adobrogo/projects/kedro/multirunner-demo/spaceflights-multirun. __init__.py:139 To set a different path, run '%reload_kedro ' Hello config register resolver [09/28/23 15:01:52] INFO Kedro project spaceflights-multirun __init__.py:108 INFO Defined global variable 'context', 'session', 'catalog' and 'pipelines' __init__.py:109 [09/28/23 15:01:53] INFO Registered line magic 'run_viz' __init__.py:115 In [1]: ```

I think that's all the places that read the pipelines from kedro.framework.project.

Context

I wrote an example based on Spaceflights starter that uses modular pipelines feature and omegaconf resolver to create pseudo dynamic pipelines (I will soon publish this for review):

    pipes = []
    for family, variants in MODEL_FAMILIES.items():
        for model_variant in variants:
            pipes.append(
                pipeline(
                    data_science_pipeline,
                    inputs={"model_input_table": "model_input_table"},
                    namespace=f"{family}.{model_variant}",
                    tags=[model_variant]
                )
            )

Currently MODEL_FAMILIES is a static variable that is validated against what is defined in parameters, configuration like:

Configuration sample ```yaml model_options: test_size: 0.2 random_state: 3 target: price features: - engines - passenger_capacity - crew - d_check_complete - moon_clearance_complete - iata_approved - company_rating - review_scores_rating # unused, it's defined for demo purposes model: sklearn.linear_model.LinearRegression model_params: {} # model family price_predictor: _overrides: model_params: gamma: 3 features: - engines - passenger_capacity - crew - d_check_complete model_options: ${merge:${model_options},${._overrides}} # model variants base: model_options: ${..model_options} candidate1: model_options: ${merge:${..model_options},${._overrides}} _overrides: features: - engines - passenger_capacity - crew - d_check_complete - company_rating candidate2: model_options: ${merge:${..model_options},${._overrides}} _overrides: model_params: gamma: 2.5 candidate3: model_options: ${..model_options} model_families: ${register_model_families:} ```

It would be fully dynamic if the order of loading were consistent and then custom omegaconf resolver could populate MODEL_FAMILIES that defines namespaces of modular pipelines shown before. I believe this is minimum effort change to achieve dynamic pipelines functionality.

Possible Implementation

Due to lazy evaluation of many of kedro resources, for KedroSession its sufficient to load catalog before reading pipeline to run. They don't depend on each other and it can be simply swapped.

For kedro registry some config read could be introduced.

However, for better consistency in all places, the _ProjectPipelines class from kedro.framework.project should refer to config in some way to make it actually load.

Related issues:

noklam commented 11 months ago

Minor Github hack, structure the issue in bullet point it will render title automatically. i.e.

- #3000 
- #2663 
- #2627
-  #2626  
datajoely commented 7 months ago

Great push - thanks for the well thought out issue @Lasica !