Open bpmeek opened 3 weeks ago
As a Kedro user I have always wanted to be able to define modular pipelines in a config file.
And you're not alone!
Your data team will go through this cycle, sorry I don’t make the rules
First of all, this doesn't have to start its life in kedro-org/kedro-plugins
. I encourage you to create your plugin in your personal account, see for example https://github.com/astrojuanlu/kedro-init or https://github.com/noklam/kedro-softfail-runner. Happy to keep this issue open to help you make progress on that.
To create the package, we don't have a plugin template yet https://github.com/kedro-org/kedro/issues/2685 but you can start with https://github.com/astrojuanlu/copier-pylib (shameless self-plug) and take it from there.
This is just one idea on how the Developer Experience could be:
$ kedro new ... && cd my_project && uv venv && source .venv/bin/activate # assume (.venv) in all prompts
$ uv pip install -r requirements.txt
$ uv pip install kedro-yaml-pipelines # Your plugin
$ kedro yaml-pipeline create data_processing # Not particularly beautiful, open to ideas here
$ tree src/my_project/pipelines
src/my_project/pipelines/
├── __init__.py
├── data_processing
│ ├── __init__.py
│ ├── nodes.py
│ └── pipeline.yaml # <------- The YAML definition!
$ # Or, alternatively, YAML pipelines are defined in a central location?
$ tree src/my_project
src/my_project/
├── __init__.py
├── __main__.py
├── hooks.py
├── pipeline_registry.py
├── pipelines
│ ├── __init__.py
│ ├── pipelines.yaml # <---- All pipelines are defined here?
│ ├── data_processing.py # <---- Maybe even no need for `nodes.py`?
$ # Make edits to nodes, YAML definition
$ kedro run
...
# Everything works as usual!
Now, the only blocker I see off the top of my head is the pipeline_registry.py
. Maybe you could tell the user to slightly modify it as follows:
from kedro.framework.project import find_pipelines
from kedro.pipeline import Pipeline
+from kedro_yaml_pipelines.registry import find_pipelines as find_yaml_pipelines
+
def register_pipelines() -> Dict[str, Pipeline]:
"""Register the project's pipelines.
@@ -12,5 +14,6 @@ def register_pipelines() -> Dict[str, Pipeline]:
A mapping from pipeline names to ``Pipeline`` objects.
"""
pipelines = find_pipelines()
+ pipelines += find_yaml_pipelines()
pipelines["__default__"] = sum(pipelines.values())
return pipelines
and hopefully this should be it.
Of course there are lots of variations for this. The key points are
Notice that this is a departure from your original request - that pipelines.yml
live under conf/
. This is based on the opinion I stated in https://github.com/kedro-org/kedro/pull/3904#issuecomment-2149024266 - but again, your plugin, your rules :)
Description
As a Kedro user I have always wanted to be able to define modular pipelines in a config file.
Context
I believe that doing so will reduce the likelihood of a user inadvertently impacting a pipeline other than the one intended when making changes to
pipeline_registry
.Possible Implementation
I created PR #3904 in the Kedro core repository and @datajoely mentioned it might make more sense to make this a plugin rather than part of Kedro Core, but I'm unsure which directory it would belong in.
Possible Alternatives
@datajoely mentioned a possible alternative here