kedro-org / kedro-plugins

First-party plugins maintained by the Kedro team.
Apache License 2.0
84 stars 76 forks source link

Define modular pipelines with config file #713

Open bpmeek opened 3 weeks ago

bpmeek commented 3 weeks ago

Description

As a Kedro user I have always wanted to be able to define modular pipelines in a config file.

Context

I believe that doing so will reduce the likelihood of a user inadvertently impacting a pipeline other than the one intended when making changes to pipeline_registry.

Possible Implementation

I created PR #3904 in the Kedro core repository and @datajoely mentioned it might make more sense to make this a plugin rather than part of Kedro Core, but I'm unsure which directory it would belong in.

Possible Alternatives

@datajoely mentioned a possible alternative here

astrojuanlu commented 3 weeks ago

As a Kedro user I have always wanted to be able to define modular pipelines in a config file.

And you're not alone!

Your data team will go through this cycle, sorry I don’t make the rules

null

First of all, this doesn't have to start its life in kedro-org/kedro-plugins. I encourage you to create your plugin in your personal account, see for example https://github.com/astrojuanlu/kedro-init or https://github.com/noklam/kedro-softfail-runner. Happy to keep this issue open to help you make progress on that.

To create the package, we don't have a plugin template yet https://github.com/kedro-org/kedro/issues/2685 but you can start with https://github.com/astrojuanlu/copier-pylib (shameless self-plug) and take it from there.

This is just one idea on how the Developer Experience could be:

$ kedro new ... && cd my_project && uv venv && source .venv/bin/activate  # assume (.venv) in all prompts
$ uv pip install -r requirements.txt
$ uv pip install kedro-yaml-pipelines  # Your plugin
$ kedro yaml-pipeline create data_processing  # Not particularly beautiful, open to ideas here
$ tree src/my_project/pipelines
src/my_project/pipelines/
├── __init__.py
├── data_processing
│   ├── __init__.py
│   ├── nodes.py
│   └── pipeline.yaml  # <------- The YAML definition!
$ # Or, alternatively, YAML pipelines are defined in a central location?
$ tree src/my_project
src/my_project/
├── __init__.py
├── __main__.py
├── hooks.py
├── pipeline_registry.py
├── pipelines
│   ├── __init__.py
│   ├── pipelines.yaml  # <---- All pipelines are defined here?
│   ├── data_processing.py  # <---- Maybe even no need for `nodes.py`?
$ # Make edits to nodes, YAML definition
$ kedro run
...
# Everything works as usual!

Now, the only blocker I see off the top of my head is the pipeline_registry.py. Maybe you could tell the user to slightly modify it as follows:

 from kedro.framework.project import find_pipelines
 from kedro.pipeline import Pipeline

+from kedro_yaml_pipelines.registry import find_pipelines as find_yaml_pipelines
+

 def register_pipelines() -> Dict[str, Pipeline]:
     """Register the project's pipelines.
@@ -12,5 +14,6 @@ def register_pipelines() -> Dict[str, Pipeline]:
         A mapping from pipeline names to ``Pipeline`` objects.
     """
     pipelines = find_pipelines()
+    pipelines += find_yaml_pipelines()
     pipelines["__default__"] = sum(pipelines.values())
     return pipelines

and hopefully this should be it.

Of course there are lots of variations for this. The key points are

  1. This is your plugin. So feel free to tailor the DX to your needs. Don't let us dictate how it should be - my suggestions above are just suggestions.
  2. We are happy to guide you on how to create such a plugin. Whether or not that becomes official is another story - but if we see enough traction, I think we should seriously consider it!
astrojuanlu commented 3 weeks ago

Notice that this is a departure from your original request - that pipelines.yml live under conf/. This is based on the opinion I stated in https://github.com/kedro-org/kedro/pull/3904#issuecomment-2149024266 - but again, your plugin, your rules :)