Enhance Kedro Namespaces adoption

DimedS commented 23 hours ago

Kedro namespaces are currently not widely used. The team is divided on the reasons for this:

Local issues such as incomplete docs, unresolved technical challenges, and potential user concerns about the interface.
The namespace feature has been primarily suited for pipeline reusability. However, due to its complexity and lack of successful adoption over the past five years, it may require a significant redesign.

This parent issue aims to facilitate an agreed-upon decision regarding the points above and address these concerns. It is also tied to the goal of improving deployment functionality, where namespaces should play a pivotal role in node grouping.

History

Explanation of namespaces by @idanov, 2020
2023-11-29 Namespaces TD, Part 1
2023-12-06 Namespaces TD, Part 2
[2023-11-20 Namespaces TD, Part 3](the link is in progress)
Changing modular pipelines terminology GitHub thread

Improving docs

[ ] The current documentation focuses primarily on how namespaces enhance pipeline reusability (see docs). However, this ticket proposes updating the docs to include a clear definition of namespaces, highlighting that they are similar to node tagging but do not allow overlaps. This makes namespaces an excellent choice for creating groups of nodes that can be executed together without conflicts. Suggested docs example: -Create pipelines without namespaces: Show how to build basic pipelines.
-Create namespaced pipelines: Use the initial pipelines to create namespaced versions.
-Combine pipelines: Build a final pipeline by combining the namespaced ones.
-Visualise: Include a visualisation using Kedro-Viz (link to ticket in progress).
[ ] #4016
[ ] Clarifying Modularity. The term "modularity" currently appears to relate to creating pipelines in separate folders, not namespaces. If this interpretation is correct, we should explicitly clarify this distinction in the docs.

Technical issues

Several technical issues were highlighted by @idanov during the last TD. These will be moved here for tracking (details in progress).

User interface

There is a potential user interface concern affecting namespace adoption, which might benefit from design attention (@stephkaiser, @iamelijahko).

Namespaces are conceptually similar to tags (but without node overlaps), yet tagging adoption is strong, especially for deployment purposes.

The UI for tags and namespaces differs significantly:

Tagging Example: Tags are added directly during node or pipeline creation:

node(func=add, inputs=["a", "b"], outputs="sum", name="adding_a_and_b", tags="node_tag")

Alternatively, for pipelines:

my_pipeline = pipeline([...], tags="pipeline_tag")

Namespace Example: Namespaces are applied at the pipeline creation level and involve multiple steps:

Create a pipeline without namespaces:

part1 = pipeline([
node(func=split_data, inputs=["a", "b"], outputs=["c", "d"], name="node1"),
node(func=split_data, inputs=["c", "d"], outputs=["e", "f"], name="node2"),
])

Add a namespace:

part1_ns = pipeline(part1, namespace="part1_ns") # pipeline name most likely repeats namespace name

This prefixes all inputs, outputs, and parameters with part1., which most likely not to be desired. To preserve naming:

part1_ns = pipeline(part1, namespace="part1_ns", inputs={"a", "b"}, outputs={"e", "f"})
# I need to specify my inputs and outputs twice

Combine pipelines in the registry:

my_pipeline = part1_ns + part2_ns + ...
# Each of my node group - actually a pipeline, but in tags - it's just a tag of a node

Tags are applied directly to nodes, whereas namespaces require changes at the pipeline level. Simplifying the namespace UI or aligning it more closely with tagging might also improve adoption.

Few other UI gaps reported by users:

[ ] #3448
[ ] #3679

Namespaces in deployment

We aim to unify and implement node grouping functionality for deployment purposes in #4319. Namespaces appear to be a great fit for this purpose. However, the ongoing work to increase namespace adoption from the current ticket must be completed on the same time.

datajoely commented 23 hours ago

Adding a bit of context - deep integration with Kedro-Viz was the first attempt to drive adoption and improve explainability:

@limdauto published these docs in October 2021
The earliest version of Kedro-Viz which included this in was tagged for release in Feb 2020

We have spent nearly 5 years trying to explain this to users in various ways - We must pivot strategy.

astrojuanlu commented 8 hours ago

Thanks @DimedS for opening this issue.

First, I would like to agree that tags do not guarantee non-overlapping pipeline partitioning. This has been said time and time again.

But I am going to push back against the idea that namespaces are the right solution for that problem. The main reason is that they were probably never designed to solve it in the first place!

Namespaces were born as "prefixes" and were introduced in Kedro 0.15.4 in October, 2019:

3c0f097991119cce5f42de8844686de104604bf4

(https://github.com/McK-Private/private-kedro/pull/286, private link)

And then in 0.16.0 the modern concept of "modular pipelines" with namespace was introduced in March 2020:

af046ca6c738a89e19d6e31ab432a13b0b184190

Therefore a bit less or a bit more than 5 years have passed, depending on how you look at it.

The original context and discussion have forever been lost in time https://jira.quantumblack.com/browse/KED-1105 (broken internal link) but we can get a glimpse of what the intent of the feature was from this comment:

Nikos pointed me to this and having thought a bunch about vertical pipeline development and pipeline re-use, recently

(https://github.com/McK-Private/private-kedro/pull/286#issuecomment-542717548, private link)

In addition, this is how the documentation of prefixes, and later namespaces, looked like:

Prefixes https://docs.kedro.org/en/0.15.4/04_user_guide/06_pipelines.html#using-a-modular-pipeline-twice (https://github.com/McK-Private/private-kedro/pull/286, private link)
Namespaces https://docs.kedro.org/en/0.16.0/04_user_guide/06_pipelines.html#using-a-modular-pipeline-twice (https://github.com/McK-Private/private-kedro/pull/569, private link)

The docs have always described namespaces (prefixes) as a way to reuse pipelines. There were zero review comments in those two PRs raising concerns about that.

To note, nobody from the current team participated in the original 0.15 discussion.

Therefore, I can only conclude that namespaces were always designed for pipeline reuse in mind.

Implying that namespaces have always been the solution for pipeline non-overlapping partitioning is, in my view, a big unqualified opinion that has no backing in historical written evidence. And as such, saying that "the docs are wrong" is a misrepresentation of what those docs were supposed to describe.

If anything, we're now retrofitting namespaces to solve a problem they weren't intended to solve in the first place.

I am going to push back against doing incremental improvements on a feature that nobody has dared to touch in 5 years, that's difficult to understand even for Kedro engineers, let alone for our users (regardless of their intended use case), and that we're probably retrofitting to solve a problem they weren't designed for.

My recommendation is that we look at the problem of non-overlapping pipeline partitioning with fresh eyes, go back to the drawing board, and prototype.

datajoely commented 8 hours ago

I would also say from users

Pipeline reuse is either a solved or minimal problem these days
Deployment is the much more acute sore sport. Dependency isolation, container granularity all fall into this space and we're not doing a good job of any.

DimedS commented 5 hours ago

Thank you for your comments, @datajoely and @astrojuanlu. I see that there isn’t a consensus within the team about the future of namespaces, so I’ve updated the header of this issue to reflect your perspectives.

I propose that we continue the discussion about deployment node grouping in the next Tech Design meeting with an open mind to all grouping possibilities - not limited to namespaces. If, during that discussion, we determine that namespaces are essential for deployment, we can revisit this conversation and make a decision on their future.

datajoely commented 3 hours ago

Great - I'll also link to this write up from last year: https://github.com/kedro-org/kedro/wiki/Synthesis-of-research-related-to-deployment-of-Kedro-to-modern-MLOps-platforms

kedro-org / kedro