kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
10.02k stars 906 forks source link

Enhance Kedro Namespaces adoption #4343

Open DimedS opened 23 hours ago

DimedS commented 23 hours ago

Kedro namespaces are currently not widely used. The team is divided on the reasons for this:

  1. Local issues such as incomplete docs, unresolved technical challenges, and potential user concerns about the interface.
  2. The namespace feature has been primarily suited for pipeline reusability. However, due to its complexity and lack of successful adoption over the past five years, it may require a significant redesign.

This parent issue aims to facilitate an agreed-upon decision regarding the points above and address these concerns. It is also tied to the goal of improving deployment functionality, where namespaces should play a pivotal role in node grouping.

History


Improving docs


Technical issues

Several technical issues were highlighted by @idanov during the last TD. These will be moved here for tracking (details in progress).


User interface

There is a potential user interface concern affecting namespace adoption, which might benefit from design attention (@stephkaiser, @iamelijahko).

Tags are applied directly to nodes, whereas namespaces require changes at the pipeline level. Simplifying the namespace UI or aligning it more closely with tagging might also improve adoption.

Few other UI gaps reported by users:

Namespaces in deployment

We aim to unify and implement node grouping functionality for deployment purposes in #4319. Namespaces appear to be a great fit for this purpose. However, the ongoing work to increase namespace adoption from the current ticket must be completed on the same time.

datajoely commented 23 hours ago

Adding a bit of context - deep integration with Kedro-Viz was the first attempt to drive adoption and improve explainability:

We have spent nearly 5 years trying to explain this to users in various ways - We must pivot strategy.

astrojuanlu commented 8 hours ago

Thanks @DimedS for opening this issue.

First, I would like to agree that tags do not guarantee non-overlapping pipeline partitioning. This has been said time and time again.

But I am going to push back against the idea that namespaces are the right solution for that problem. The main reason is that they were probably never designed to solve it in the first place!

Namespaces were born as "prefixes" and were introduced in Kedro 0.15.4 in October, 2019:

3c0f097991119cce5f42de8844686de104604bf4

(https://github.com/McK-Private/private-kedro/pull/286, private link)

And then in 0.16.0 the modern concept of "modular pipelines" with namespace was introduced in March 2020:

af046ca6c738a89e19d6e31ab432a13b0b184190

Therefore a bit less or a bit more than 5 years have passed, depending on how you look at it.

The original context and discussion have forever been lost in time https://jira.quantumblack.com/browse/KED-1105 (broken internal link) but we can get a glimpse of what the intent of the feature was from this comment:

Nikos pointed me to this and having thought a bunch about vertical pipeline development and pipeline re-use, recently

(https://github.com/McK-Private/private-kedro/pull/286#issuecomment-542717548, private link)

In addition, this is how the documentation of prefixes, and later namespaces, looked like:

The docs have always described namespaces (prefixes) as a way to reuse pipelines. There were zero review comments in those two PRs raising concerns about that.

To note, nobody from the current team participated in the original 0.15 discussion.

Therefore, I can only conclude that namespaces were always designed for pipeline reuse in mind.

Implying that namespaces have always been the solution for pipeline non-overlapping partitioning is, in my view, a big unqualified opinion that has no backing in historical written evidence. And as such, saying that "the docs are wrong" is a misrepresentation of what those docs were supposed to describe.

If anything, we're now retrofitting namespaces to solve a problem they weren't intended to solve in the first place.


I am going to push back against doing incremental improvements on a feature that nobody has dared to touch in 5 years, that's difficult to understand even for Kedro engineers, let alone for our users (regardless of their intended use case), and that we're probably retrofitting to solve a problem they weren't designed for.

My recommendation is that we look at the problem of non-overlapping pipeline partitioning with fresh eyes, go back to the drawing board, and prototype.

datajoely commented 8 hours ago

I would also say from users

DimedS commented 5 hours ago

Thank you for your comments, @datajoely and @astrojuanlu. I see that there isn’t a consensus within the team about the future of namespaces, so I’ve updated the header of this issue to reflect your perspectives.

I propose that we continue the discussion about deployment node grouping in the next Tech Design meeting with an open mind to all grouping possibilities - not limited to namespaces. If, during that discussion, we determine that namespaces are essential for deployment, we can revisit this conversation and make a decision on their future.

datajoely commented 3 hours ago

Great - I'll also link to this write up from last year: https://github.com/kedro-org/kedro/wiki/Synthesis-of-research-related-to-deployment-of-Kedro-to-modern-MLOps-platforms