kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.47k stars 875 forks source link

Define modular pipelines in config #3904

Closed bpmeek closed 1 week ago

bpmeek commented 1 month ago

Description

As a Kedro user I have always wanted to be able to define modular pipelines in a config file. I believe that doing so will reduce the likelihood of a user inadvertently impacting a pipeline other than the one intended.

Development notes

I have adjusted two files: comegaconf_config.py now has "pipelines": ["pipelines.yml"], added to its config_patterns kedro/framework/project/__init__.py has an additional function from_config that can read a config entry and build the entry into a modular pipeline which is then returned. A portion of find_pipelines was abstracted into a helper function _get_pipeline_obj for use in both functions. This was tested using make test and manually tested with the default example_pipeline created when creating a new Kedro project.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

datajoely commented 1 month ago

Hi @bpmeek thank you for this hard work. The topic of what can be declarative and what should be imperative is something that the team have wrestled with for a long long time.

I think my personal preference is that this should be a plug-in rather than in kedro core, but keen to get other's perspective

merelcht commented 1 month ago

Hi @bpmeek, echoing @datajoely in thanking you the work you've put in this PR! I think in the future it might be better to open an issue or discussion first to talk about fundamental changes like this.

As Joel mentioned, this is a topic we've talked about a lot and we even had an internal version of Kedro where you could define pipelines and nodes in yml instead of python.

With the solution you're proposing I'm wondering why you'd like to define modular pipelines in config, but not regular pipelines. And what about nodes? I'd also like to understand more what the problem is that will be solved by this?

bpmeek commented 1 month ago

Really appreciative, @datajoely and @merelcht for the recognition on my work. Off the bat, you're 100% right @merelcht, and normally I would have gone ahead and opened an issue as I have in the past, but I was in a unique situation where I had the time to do this work so went ahead and did it instead of starting with an issue as I will most likely be using this either way in the future.

@merelcht, with this solution I'm defining modular pipelines, and pipeline chains, because of a recent project I was working on that kept growing and my pipeline registry ended up becoming a huge file that was hard to manage. Had I been able to take a couple of the larger more complex pipelines out and moved them into a config file I think it would've been much easier to work with.

That being said, I actually do really like the idea of also defining pipelines in a config file as well, as I think this could simplify the problem of "I already have this pipeline but I need to skip this node"

@datajoely I am amenable to the idea of having this in a plug-in instead of kedro core, especially considering I couldn't come up with a simpler way to access pipelines.yml

datajoely commented 1 month ago

That being said, I actually do really like the idea of also defining pipelines in a config file as well, as I think this could simplify the problem of "I already have this pipeline but I need to skip this node"

now this is interesting - the canonical Kedro way of doing this is to use pipeline algebra:


def register_pipelines() -> Dict[str, Pipeline]:

    return {
         "my_pipeline" : pipeline_1
         "my_pipeline2" : pipeline_1 - nodes_to_exclude
    }

It's not super clear what your YAML should look like from the PR, what do you imagine this look like?

bpmeek commented 1 month ago

@datajoely thanks, I wasn't aware it was something that simple!

It's not super clear what your YAML should look like from the PR, what do you imagine this look like?

I have a really basic example in the docstring for from_config but if your modular pipeline looked like:

modular_pipeline = pipeline(
    pipe=data_processing,
    inputs={
        input_a: input_b
    },
    namespace=modular_example
)

then your pipelines.yml would look like:

<name_doesn't_matter>:
  pipe:
    - data_processing
  inputs:
    input_a: input_b
  namespace: modular_example  

Also, my additional two cents. I think it makes additional sense to add this functionality as, in my mind, changing the input/output of a modular pipeline is no different conceptually than changing an entry in the data catalog. In other words, code changes are changes to the underlying functionality and config changes are changes to how the underlying functionality interacts with each other.

datajoely commented 1 month ago

Okay interesting - I think I have a better way that we could this (@noklam @astrojuanlu would love your view too)

So you can actually run a 1:1 from the Kedro CLI with a YAML using kedro run --config some_file.yml

In my view, if we were to make the Pipeline CLI slicing commands more expressive then configuring through YAML would improve automatically through the existing mechamnism

astrojuanlu commented 4 weeks ago

Thanks @bpmeek for this contribution!

On the topic of pipelines as YAML, xref #650, https://github.com/kedro-org/kedro/discussions/3872

On the topic of pipeline algebra, I must say this is something most of our users struggle with #3233

Without having looked at the code, I have two comments to make:

  1. I don't think pipelines.yml should be in conf/ at all. Pipeline structure is hardly "configuration" in the way we usually think about it (so much so that we are debating whether even dataset types should be config, see #770). Instead, if we wanted users to define pipelines in YAML, I'd rather see those yamls shipped with the source:
❯ tree
.
├── README.md
├── conf
│   ├── README.md
│   ├── base
│   ├── local
│   └── logging.yml
...
├── src
│   ├── package_name
│   │   ├── __init__.py
│   │   ├── __main__.py
│   │   ├── pipelines
│   │   │   ├── __init__.py
│   │   │   ├── data_processing
│   │   │   │   ├── __init__.py
│   │   │   │   ├── nodes.py
│   │   │   │   └── pipeline.yaml  # <--------------
  1. I 💯 % agree with @datajoely, this should start its life as a Kedro plugin. I'll drop some ideas in https://github.com/kedro-org/kedro-plugins/issues/713
noklam commented 3 weeks ago

Had I been able to take a couple of the larger more complex pipelines out and moved them into a config file I think it would've been much easier to work with.

I think I don't get this part well. Would breaking pipeline_registry.py into smaller files in import them from pipeline_registry.py solve this problem already? I am curious to see how the pipeline_registry.py looks like as I haven't seen crazily complex registry before, or do you mean pipeline.py instead?

In my mind, the py -> YAML is mostly 1:1 translation from a dictionary to YAML and does not simplify anything. It may make editing easier(?), but worse for reading (you lose all the IDE features that can brings you to the definition/references).

Pipeline slicing

This is more interesting to me and I think it does have some value. The slicing is already supported by Kedro Pipeline API, but the CLI only limited to certain expression.