kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
10.02k stars 906 forks source link

The pipeline registry is difficult to understand #3233

Open astrojuanlu opened 1 year ago

astrojuanlu commented 1 year ago

The current code for pipeline_registry.py in the default template is as follows:

https://github.com/kedro-org/kedro/blob/df9f174864640de193b2b85f04d0c3e8aee7d22c/kedro/templates/project/%7B%7B%20cookiecutter.repo_name%20%7D%7D/src/%7B%7B%20cookiecutter.python_package%20%7D%7D/pipeline_registry.py#L1-L16

Apart from #2526, this is fine and works well. The magic is in kedro.framework.project.find_pipelines, which scans different directories searching for a create_pipeline function:

https://github.com/kedro-org/kedro/blob/df9f174864640de193b2b85f04d0c3e8aee7d22c/kedro/framework/project/__init__.py#L292

This is so magical though, that the moment users want to manually register pipelines, they go crazy. For example, this is a user that was trying something like kedro run --pipeline=data_science+evaluation, which is a beautiful syntax by the way https://linen-slack.kedro.org/t/15697047/i-have-a-quick-question-on-running-selected-pipelines-only-i#b93fe172-d54f-4f51-a8a6-b85f9dbcec32

to which I replied, how would I subtract a pipeline?

def register_pipelines() -> dict[str, Pipeline]:
    """Register the project's pipelines.

    Returns:
        A mapping from pipeline names to ``Pipeline`` objects.
    """
    pipelines = find_pipelines()
    pipelines["__default__"] = sum(pipelines.values())
    pipelines["except-train"] = ???
    return pipelines

in the end I did this:

from .pipelines.model_training import create_pipeline as create_model_training_pipeline

...
pipelines["all"] = sum(pipelines.values())
pipelines["all_except_eval"] = pipelines["all"] - create_model_training_pipeline()

but @noklam suggested this instead

pipelines["all_except_eval"] = pipelines["all"] - pipelines["eval"]

This week I saw a user do something similar, but they renamed the functions instead:

https://github.com/pablovdcf/TFM_HADO_Cares/blob/28d5a024b915169a039a5a84996b9ee11ee1f3ee/hado/src/hado/pipeline_registry.py#L5-L7

and since their pipeline creation functions were not named create_pipeline but something else, this completely broke the automagic find_pipelines for them.

astrojuanlu commented 1 year ago

Another pattern: repeatedly using create_pipeline https://linen-slack.kedro.org/t/16062967/i-think-this-might-be-a-versioning-question-i-created-a-kedr#e92a5668-7e06-41e7-8825-3ec18fff1c0c

from kedro.framework.project import find_pipelines
from kedro.pipeline import Pipeline

from network_anomaly_detection.pipelines import (
    data_collection as dc,
    data_engineering as de,
    ...

def register_pipelines() -> Dict[str, Pipeline]:
    ...
    data_collection_pipeline = dc.create_pipeline()
    data_engineering_pipeline = de.create_pipeline()
    ...

    return {
        "dc": data_collection_pipeline,
        "de": data_engineering_pipeline,
        ...
        "__default__": data_collection_pipeline + data_engineering_pipeline + data_science_pipeline + plot_pipeline
    }