kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.49k stars 874 forks source link

Rectify "modular pipelines" terminology #2723

Open astrojuanlu opened 1 year ago

astrojuanlu commented 1 year ago

Description

We're making various distinctions in our documentation about "Pipelines" and "Modular pipelines", for example in the TOC:

image

And in our wording:

In many typical Kedro projects, a single (“main”) pipeline increases in complexity as the project evolves. To keep your project fit for purpose, we recommend that you create modular pipelines, which are logically isolated and can be reused.

This wrapper really unlocks the power of modular pipelines. from kedro.pipeline.modular_pipeline import pipeline

To the point that I believed namespaces were the same as modular pipelines.

However, it turns out that Pipelines and Modular Pipelines are mostly the same thing, and that kedro.pipeline.modular_pipeline.pipeline is not a wrapper over kedro.pipeline.pipeline: they're the same function.

https://github.com/kedro-org/kedro/blob/160fd6bc66d4d780652dd77499abeb23e13f1c09/kedro/pipeline/__init__.py#L5

This is also related from this comment that I didn't fully understand back then: https://github.com/kedro-org/kedro/issues/2402#issuecomment-1460087177

Context

It's a key concept for reusability that many users use.

Possible Implementation

Possible Alternatives

There are maybe less disruptive paths but I can't think of alternative ways of rectifying the current terminology.

astrojuanlu commented 1 year ago

Maybe ❓ Move pipeline from modular_pipeline.py to pipeline.py and delete modular_pipeline.py. This would break any imports from kedro.pipeline.modular_pipeline import pipeline but not from kedro.pipeline import pipeline

From https://github.com/kedro-org/kedro/pull/1147

stichbury commented 1 year ago

We had extensive discussions about how to refer to pipelines and did some user research. I've looked for the notes but because it was a couple of years ago and I think they were on the internal GitHub repo, I cannot find them. @yetudada and @idanov may have them, or @merelcht but I think we should revisit the discussion given that you've found the usage misleading as it currently stands.

astrojuanlu commented 1 year ago

I'm happy to have a look at those notes, but regardless, I think this terminology is unnecessarily complicated as it stands today. It gives the impression that there are 3 kinds of pipelines:

When in fact, there's only one ("pipelines", which under the hood in Kedro are built with the modular_pipeline.pipeline helper), some of which happen to be registered (with pipeline autodiscovery, all of them in most cases).

Maybe let's chat about this next week.

noklam commented 11 months ago

I would suggest to review the modular pipeline as a whole.

  1. I had a long discussion on Slack with one of our user. The docs are confusing and I struggle to understand.

The example also use a new pipeline which use a cooking analogy, which is nice but the problem is this pipeline does not exist anywhere. This is an advance and one of the more complicate feature, playing with the pipeline and seeing it in kedro-viz helps a lot to understand the feature.

https://docs.kedro.org/en/stable/nodes_and_pipelines/modular_pipelines.html#how-to-use-a-modular-pipeline-with-different-parameters.

  1. Many users has been using tag over namespace, and currently namespace is basically just prefix dataset. People prefer flat structure over many hierarchies. For example, #2756 is making this change for pipeline creation. On the other hand, keeping the structure make the pipeline more isolated and easier to work with micro-packaging, but I think this is less important. We also need to think of how this will work for universal deployment. What's the best way to organise pipeline easily and translate (compile) a Kedro DAG to other tool?
MatthiasRoels commented 11 months ago

I agree with @noklam here in that we should review the modular pipeline as a whole. For smaller pipelines and projects (where there are less pipelines in general), there is no actual issue other than the confusing terminology.

But for projects with lots of pipelines (and pipelines with lots of nodes), I think there is room for improvement of the concept of a kedro pipeline itself. In my view, there are 3 points of view to take into account when designing for a solution:

  1. deployment: essentially, a pipeline is just a collection of nodes and the inputs/outputs determine a graph structure (which is exactly how kedro implements the concept!). However, translating a kedro node into a step to be executed by an orchestration tool (Airflow, Argo Workflows, ...) leads to a lot of compute overhead. Imagine for example a situation where you have many fast running nodes that need to be scheduled on a k8s cluster. For each of these nodes, a pod needs to start (container image needs to be downloaded on the node, pod needs to start, run, finish and communicate its status to the orchestrator). For many nodes, the overhead leads to a pipeline not being executed as efficient as possible. Hence the optimal case is either to create "bigger" nodes (combining logic from many nodes into one node) or run a collection of nodes in one orchestration task. The later hints towards something like running a sub-pipeline (or whatever). In any case, the most optimal scenario is if you are able to map a collection of nodes into one step in an orchestration tool (however that would work).
  2. pipeline "discovery": kedro-viz is the best tool for the job here! But, for big pipelines, it might be helpful to see additional structure; collapsing nodes of the same namespace is very helpful, but also to have a view how deployment works out, i.e. which nodes are mapped onto the same step in the orchestrator and basically have a view on how that works.
  3. development: to bring additional structure to big pipelines, it is useful to create sub-pipelines to re-use bigger chunks of work. This is more or less what a modular pipeline wants to achieve, but the need of introducing namespaces makes it quite complex to use I guess?

Anyway, these are just my thoughts on the topic.

astrojuanlu commented 11 months ago

Thanks @MatthiasRoels for the writeup! About (1), indeed @noklam has some thoughts about this, the granularity issue when deploying Kedro projects is something we want to look into (we have another issue about it but I don't remember which one is it), for (2) I've seen how Kedro Viz looks like for huge projects and indeed needs more work, and (3) what do you mean by sub-pipelines without namespaces?

MatthiasRoels commented 11 months ago

Cool, I am curious about @noklam's thoughts on this!

(3) what do you mean by sub-pipelines without namespaces?

This is not what I meant, what I wanted to say was that the concept of namespaces might be complex for some users when you just want make a subset of nodes re-usable as a whole. But I might be wrong on this too!

astrojuanlu commented 9 months ago

For the record (because I keep losing this link): issue in the private repository that collected research around terminology https://github.com/quantumblacklabs/private-kedro/issues/806

noklam commented 9 months ago

I need to get better at Github notification, I only saw this in an email today😅

(3) what do you mean by sub-pipelines without namespaces? Currently the namespace is mainly used for two purposes:

  1. Kedro-viz, the ability to filter, collapse pipeline
  2. To avoid name conflicts, you cannot have two datasets with identical names, thus you need to apply namespace to stick a prefix.

I guess this is what you mean by using sub-pipelines without namespaces?

noklam commented 9 months ago

deployment: essentially, a pipeline is just a collection of nodes and the inputs/outputs determine a graph structure (which is exactly how kedro implements the concept!). However, translating a kedro node into a step to be executed by an orchestration tool (Airflow, Argo Workflows, ...) leads to a lot of compute overhead. Imagine for example a situation where you have many fast running nodes that need to be scheduled on a k8s cluster. For each of these nodes, a pod needs to start (container image needs to be downloaded on the node, pod needs to start, run, finish and communicate its status to the orchestrator). For many nodes, the overhead leads to a pipeline not being executed as efficient as possible. Hence the optimal case is either to create "bigger" nodes (combining logic from many nodes into one node) or run a collection of nodes in one orchestration task. The later hints towards something like running a sub-pipeline (or whatever). In any case, the most optimal scenario is if you are able to map a collection of nodes into one step in an orchestration tool (however that would work).

IMO, we need to clarify what should be done from kedro side and a kedro-plugin. Kedro shouldn't map to a specific orchestrator, this should be a plugin job.The idea of collapsing a modular pipeline/sub-pipeline to an Orchestrator node could be done by Kedro potentially. Here is an old idea that was proposed.

The 1-1 node mappings is a topic that comes up repeatedly, and at this point I think we can agree it is bad for most of the case. The logical first step is 1 pipeline = 1 node, of course it varies a lot for deployment and it also depends on how you structure your pipeline and how granular it is.

The serialisation/deserialisation cost goes up with number of nodes. Reducing the number of nodes should be the first thing to do. Some takes the approach of serialising the intermediate data to s3(or equivalent) for cross-node communication. https://pypi.org/project/vineyard-kedro/ takes this to next level and optimise it for K8s.

The challenge here for Kedro is, in a single Kedro run, the KedroSession orchestrate the whole run but in deployment it is running separately. So this orchestration step need to happen before they are sent to the Orchestrator. Essentially, when you collapse a pipeline as a node, you want everything become in-memory and only persist the data that are necessary for communication with other orchestrator nodes.

MatthiasRoels commented 9 months ago

I guess this is what you mean by using sub-pipelines without namespaces?

That’s exactly what I meant!

MatthiasRoels commented 9 months ago

Kedro shouldn't map to a specific orchestrator, this should be a plugin job.

Absolutely agree! But on the kedro side, some prep work can definitely be done that can be re-used in different plugins

The serialisation/deserialisation cost goes up with number of nodes. Reducing the number of nodes should be the first thing to do. Some takes the approach of serialising the intermediate data to s3(or equivalent) for cross-node communication. https://pypi.org/project/vineyard-kedro/ takes this to next level and optimise it for K8s.

Assuming you talk about orchestrator nodes, that’s exactly what you want to do. IMO, an object store (S3, GCS, MinIo,…) should work fine for the majority of use-cases!

The challenge here for Kedro is, in a single Kedro run, the KedroSession orchestrate the whole run but in deployment it is running separately. So this orchestration step need to happen before they are sent to the Orchestrator. Essentially, when you collapse a pipeline as a node, you want everything become in-memory and only persist the data that are necessary for communication with other orchestrator nodes.

That’s not necessarily true. You need to persist at least all datasets required in other orchestration nodes. But that doesn’t mean you don’t need to persist other datasets! I would imagine some sort of kedro compile method/cli where you construct the required data to be used by a plugin to create the required orchestrator resource (e.g. Airflow DAG). In that CLI/method, you can then do the required checks to validate if at least the datasets required for inter-node communication are persisted datasets.

noklam commented 9 months ago

I would imagine some sort of kedro compile method/cli where you construct the required data to be used by a plugin to create the required orchestrator resource (e.g. Airflow DAG).

I always want to specify data to persist(or memory) at runtime without touching catalog, that's for interactive workflow.

That’s not necessarily true. You need to persist at least all datasets required in other orchestration nodes. But that doesn’t mean you don’t need to persist other datasets! I would imagine some sort of kedro compile method/cli where you construct the required data to be used by a plugin to create the required orchestrator resource (e.g. Airflow DAG). In that CLI/method, you can then do the required checks to validate if at least the datasets required for inter-node communication are persisted datasets.

True, I focus on the minimal data that are required, of course in practice you want to customise. This is consistent with default to 1 pipeline = 1 orchestrator node, where you may want to further collapse pipeline or you may need more granularity. So this should be the default if no config is given.

MatthiasRoels commented 9 months ago

A bit of a braindump here, but if I think of an easy example where I have a kedro project consisting of 2 pipelines A and B (and obviously a __default__, which is the sum of the two). If I would then, at least conceptually, think about the process of creating a deployment for these two pipelines, the first step we should do it to figure out the order in which you need to run these two pipelines. There are three options:

  1. A first, then B
  2. B first, then A
  3. A and B in parallel

Actually, there is a fourth option but that should result in a "compile" error: the scenario where A depends on dataset_1 and produces dataset_2 whereas B depends on dataset_2 and produces dataset_1 (this looks like an artificial scenario, but believe me when I say this, if you kedro pipelines are big enough, this can happen).

So kedro core (not a plugin) needs to figure out the correct order of execution as well as the exact kedro cmd required to run pipeline A resp. B. I think with that info, you can then create specific plugins to create target deployment. In my specific case, that would be a k8s resource for Argo Workflows. I'm even imagining starting from either a predefined template or a custom one provided by the user.

I see two potential starting points:

  1. we start simple and ask the user to provide a list of kedro pipeline to orchestrate so that we can focus on implementing what I discussed above, as well as some plugins
  2. we focus on how we can split a particular pipeline in part that we want to use for orchestration (either by tag, namespace, ...). I'm also wondering if we can somehow automatically create that split if we sum two pipelines; wouldn't it be cool if we can collapse A + B into "super-nodes" A and B. This way, the user can just specify one pipeline (__default__) and kedro automatically figures out the different parts to orchestrate.
astrojuanlu commented 8 months ago

This conversation branched off quite a bit, I'll try to center the main question again:

Can somebody explain me like I'm 5 years old what makes a "modular pipeline" different from a "pipeline"?

astrojuanlu commented 8 months ago

And more:

So, if I'm correct, "a pipeline" and "a modular pipeline", depending on context, might be two entirely different categories of things: the former a Python class, the latter a directory structure. Furthermore: a modular pipeline contains a pipeline (kedro.pipeline.pipeline.Pipeline) definition.

And this is where this terminology, in my opinion, falls apart: a "modular pipeline" is not a kedro.pipeline.pipeline.Pipeline "gone modular", it's a wrapper (in the form of a bunch of Python modules with a specific structure) of a kedro.pipeline.pipeline.Pipeline. There is no IS-A (inheritance) relationship between "modular pipeline" and "pipeline", but rather a HAS-A (composition) relationship. A "modular pipeline" is not a pipeline, and it's not even a module because it's a package (a bunch of modules).

Not that I have better ideas now (and also I don't want to boil the ocean), but I wanted to at least give my interpretation.

astrojuanlu commented 7 months ago

A bit more insight on modular pipelines https://github.com/quantumblacklabs/private-kedro/issues/752#issuecomment-736680109 (private link)

(@idanov if you consent, you could copy-paste that comment here)

stichbury commented 5 months ago

I'm removing the documentation label from this as we have a docs task (#1998) to cover improvement of docs about modular pipelines. This ticket (to my mind) cover the philosophy of how we talk about modular pipelines and the language we want to use in communicating to users. It needs to happen ahead of the docs and then, when all is agreed, the docs can be overhauled. So #1998 is dependent on this (a "child" if you like) but this isn't a docs ticket.

astrojuanlu commented 1 week ago

After we merge #3948, I think the only things left are doing one last pass on the Kedro Frameworks docs and reviewing the Kedro-Viz ones.

As far as I understand (after 1 year of chewing on this issue), Kedro-Viz mostly cares about 2 things:

Since Kedro-Viz doesn't really have a user guide, there is not much to review. The word "modular" appears exactly once in the docs:

https://github.com/kedro-org/kedro-viz/blob/1d14055f5a75ba32e6db37f3bb8a24aec71986b8/docs/source/index.md?plain=1#L17

The codebase is another thing though. https://github.com/kedro-org/kedro-viz/pull/1941 refers to "modular pipelines" and so do all the Python classes, but it's actually talking about namespaces. I reckon that doing a Search & Replace might have big, unintended consequences (cc @rashidakanchwala) so it's probably not worth the effort, but at least user-facing documentation should make the concepts crystal clear.

astrojuanlu commented 1 week ago

And change our tutorial too: https://docs.kedro.org/en/stable/tutorial/add_another_pipeline.html#modular-pipelines

astrojuanlu commented 1 week ago

So, long story short:

astrojuanlu commented 1 week ago

Moving this back to our Inbox so we can re-prioritise.