Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
Currently the order between pipelines loading and config loading varies, depending on kedro command. If pipelines were to be ever dynamic depending on config/params, then they should always be read before pipelines are registered. Examples:
I wrote hello world functions to pipeline registry and config handler functions to demonstrate the order of loading:
✅ Command `kedro catalog list`
```bash
adobrogo@gidpod ..multirunner-demo/spaceflights-multirun (git)-[main] % kedro catalog list
Hello config register resolver
Hello pipeline registry function
```
❌ Command `kedro run`
```bash
adobrogo@gidpod ..multirunner-demo/spaceflights-multirun (git)-[main] % kedro run --namespace price_predictor.base --nodes price_predictor.base.debug_node
[09/28/23 14:59:54] INFO Kedro project spaceflights-multirun session.py:364
Hello pipeline registry function
Hello config register resolver
[09/28/23 14:59:55] INFO Loading data from 'params:price_predictor.base.model_options' (MemoryDataset)... data_catalog.py:492
INFO Running node: debug_node: verbose_params([params:price_predictor.base.model_options]) -> None node.py:331
INFO Verbose debug node reporting nodes.py:60
INFO Argument number:0, Value:{'test_size': 0.2, 'random_state': 3, 'target': 'price', 'features': ['engines', 'passenger_capacity', 'crew', nodes.py:62
'd_check_complete'], 'model': 'sklearn.linear_model.LinearRegression', 'model_params': {'gamma': 3}}
INFO Completed 1 out of 1 tasks sequential_runner.py:85
INFO Pipeline execution completed successfully.
```
❌ Command `kedro registry list`
```bash
adobrogo@gidpod ..multirunner-demo/spaceflights-multirun (git)-[main] % kedro registry list
Hello pipeline registry function
- __default__
- data_processing
- data_science
```
✅ Command `kedro ipython`
```bash
adobrogo@gidpod ..multirunner-demo/spaceflights-multirun (git)-[main] % kedro ipython
ipython --ext kedro.ipython
Python 3.10.4 (main, Apr 6 2023, 13:50:45) [GCC 12.2.1 20230111]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.15.0 -- An enhanced Interactive Python. Type '?' for help.
[09/28/23 15:01:51] INFO Resolved project path as: /home/adobrogo/projects/kedro/multirunner-demo/spaceflights-multirun. __init__.py:139
To set a different path, run '%reload_kedro '
Hello config register resolver
[09/28/23 15:01:52] INFO Kedro project spaceflights-multirun __init__.py:108
INFO Defined global variable 'context', 'session', 'catalog' and 'pipelines' __init__.py:109
[09/28/23 15:01:53] INFO Registered line magic 'run_viz' __init__.py:115
In [1]:
```
I think that's all the places that read the pipelines from kedro.framework.project.
Context
I wrote an example based on Spaceflights starter that uses modular pipelines feature and omegaconf resolver to create pseudo dynamic pipelines (I will soon publish this for review):
pipes = []
for family, variants in MODEL_FAMILIES.items():
for model_variant in variants:
pipes.append(
pipeline(
data_science_pipeline,
inputs={"model_input_table": "model_input_table"},
namespace=f"{family}.{model_variant}",
tags=[model_variant]
)
)
Currently MODEL_FAMILIES is a static variable that is validated against what is defined in parameters, configuration like:
It would be fully dynamic if the order of loading were consistent and then custom omegaconf resolver could populate MODEL_FAMILIES that defines namespaces of modular pipelines shown before. I believe this is minimum effort change to achieve dynamic pipelines functionality.
Possible Implementation
Due to lazy evaluation of many of kedro resources, for KedroSession its sufficient to load catalog before reading pipeline to run. They don't depend on each other and it can be simply swapped.
For kedro registry some config read could be introduced.
However, for better consistency in all places, the _ProjectPipelines class from kedro.framework.project should refer to config in some way to make it actually load.
Description
Currently the order between pipelines loading and config loading varies, depending on kedro command. If pipelines were to be ever dynamic depending on config/params, then they should always be read before pipelines are registered. Examples:
I wrote hello world functions to pipeline registry and config handler functions to demonstrate the order of loading:
✅ Command `kedro catalog list`
```bash adobrogo@gidpod ..multirunner-demo/spaceflights-multirun (git)-[main] % kedro catalog list Hello config register resolver Hello pipeline registry function ```❌ Command `kedro run`
```bash adobrogo@gidpod ..multirunner-demo/spaceflights-multirun (git)-[main] % kedro run --namespace price_predictor.base --nodes price_predictor.base.debug_node [09/28/23 14:59:54] INFO Kedro project spaceflights-multirun session.py:364 Hello pipeline registry function Hello config register resolver [09/28/23 14:59:55] INFO Loading data from 'params:price_predictor.base.model_options' (MemoryDataset)... data_catalog.py:492 INFO Running node: debug_node: verbose_params([params:price_predictor.base.model_options]) -> None node.py:331 INFO Verbose debug node reporting nodes.py:60 INFO Argument number:0, Value:{'test_size': 0.2, 'random_state': 3, 'target': 'price', 'features': ['engines', 'passenger_capacity', 'crew', nodes.py:62 'd_check_complete'], 'model': 'sklearn.linear_model.LinearRegression', 'model_params': {'gamma': 3}} INFO Completed 1 out of 1 tasks sequential_runner.py:85 INFO Pipeline execution completed successfully. ```❌ Command `kedro registry list`
```bash adobrogo@gidpod ..multirunner-demo/spaceflights-multirun (git)-[main] % kedro registry list Hello pipeline registry function - __default__ - data_processing - data_science ```✅ Command `kedro ipython`
```bash adobrogo@gidpod ..multirunner-demo/spaceflights-multirun (git)-[main] % kedro ipython ipython --ext kedro.ipython Python 3.10.4 (main, Apr 6 2023, 13:50:45) [GCC 12.2.1 20230111] Type 'copyright', 'credits' or 'license' for more information IPython 8.15.0 -- An enhanced Interactive Python. Type '?' for help. [09/28/23 15:01:51] INFO Resolved project path as: /home/adobrogo/projects/kedro/multirunner-demo/spaceflights-multirun. __init__.py:139 To set a different path, run '%reload_kedroI think that's all the places that read the
pipelines
fromkedro.framework.project
.Context
I wrote an example based on Spaceflights starter that uses modular pipelines feature and omegaconf resolver to create pseudo dynamic pipelines (I will soon publish this for review):
Currently MODEL_FAMILIES is a static variable that is validated against what is defined in parameters, configuration like:
Configuration sample
```yaml model_options: test_size: 0.2 random_state: 3 target: price features: - engines - passenger_capacity - crew - d_check_complete - moon_clearance_complete - iata_approved - company_rating - review_scores_rating # unused, it's defined for demo purposes model: sklearn.linear_model.LinearRegression model_params: {} # model family price_predictor: _overrides: model_params: gamma: 3 features: - engines - passenger_capacity - crew - d_check_complete model_options: ${merge:${model_options},${._overrides}} # model variants base: model_options: ${..model_options} candidate1: model_options: ${merge:${..model_options},${._overrides}} _overrides: features: - engines - passenger_capacity - crew - d_check_complete - company_rating candidate2: model_options: ${merge:${..model_options},${._overrides}} _overrides: model_params: gamma: 2.5 candidate3: model_options: ${..model_options} model_families: ${register_model_families:} ```It would be fully dynamic if the order of loading were consistent and then custom omegaconf resolver could populate
MODEL_FAMILIES
that defines namespaces of modular pipelines shown before. I believe this is minimum effort change to achieve dynamic pipelines functionality.Possible Implementation
Due to lazy evaluation of many of kedro resources, for
KedroSession
its sufficient to load catalog before reading pipeline to run. They don't depend on each other and it can be simply swapped.For kedro registry some config read could be introduced.
However, for better consistency in all places, the
_ProjectPipelines
class fromkedro.framework.project
should refer to config in some way to make it actually load.Related issues:
3000
2663
2627
2626