kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.88k stars 893 forks source link

Import `TRANSCODING_SEPARATOR` to pipeline to make Kedro backwards compatible with `kedro-viz` on python 3.8 #3822

Closed ankatiyar closed 5 months ago

ankatiyar commented 5 months ago

Description

Flagged by e2e test on kedro-docker this morning (but not related to kedro-docker) - https://github.com/kedro-org/kedro-plugins/actions/runs/8730295535/job/23953858295#step:8:1

Context

The problem

For python versions > 3.8, everything is okay ✅

For python version == 3.8: If a user has a project that uses both Kedro and Viz, the project becomes completely unusable -

Development notes

Proposed Solution 1

Proposed Solution 2

Other considerations (Bigger discussion)

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

deepyaman commented 5 months ago

Proposed Solution 2

  • Add an import statement for TRANSCODE_SEPARATOR to kedro.pipeline.pipeline
  • Do a patch release to make Kedro backwards compatible with kedro-viz

How about this, but just get rid of the _transcoding module and put the stuff back into pipeline. TBH I think adding the _transcoding module wasn't really necessary, and it introduced a new convention that doesn't exist anywhere else in the codebase (modules beginning with _).

But generally in favor of doing a patch release; easy fix that will avoid more questions, and show responsiveness.

deepyaman commented 5 months ago

Proposed Solution 2

  • Add an import statement for TRANSCODE_SEPARATOR to kedro.pipeline.pipeline
  • Do a patch release to make Kedro backwards compatible with kedro-viz

How about this, but just get rid of the _transcoding module and put the stuff back into pipeline. TBH I think adding the _transcoding module wasn't really necessary, and it introduced a new convention that doesn't exist anywhere else in the codebase (modules beginning with _).

But generally in favor of doing a patch release; easy fix that will avoid more questions, and show responsiveness.

To add some context to my suggestion—what was the original reason for moving stuff to the _transcoding module? The only somewhat functional thing I can see this does is make TRANSCODING_SEPARATOR not public (not totally sure why this is done; I have leveraged it in plugin code previously). However, if it was intentional to make this "private", then just importing it in pipeline undoes this, too.

noklam commented 5 months ago

we recently moved TRANSCODING_SEPARATOR to kedro.pipeline._transcoding.py from kedro.pipeline.pipeline.py

This is a breaking change and shouldn't be done in non-breaking release, we can do this but add a pointer to import from the new module instead. Agree with @deepyaman as I am not sure why this refactoring is necessary. It's not a big problem so a patch fix will do.

merelcht commented 5 months ago

To add some context to my suggestion—what was the original reason for moving stuff to the _transcoding module? The only somewhat functional thing I can see this does is make TRANSCODING_SEPARATOR not public (not totally sure why this is done; I have leveraged it in plugin code previously). However, if it was intentional to make this "private", then just importing it in pipeline undoes this, too.

It was done as part of https://github.com/kedro-org/kedro/pull/3812 to fix the issue with node dependencies uncovered by using the built-in toposort functionality. When reviewing it I didn't realise the transcoding separator was used outside of the framework and didn't think it was such a consequential refactor 😕

ElenaKhaustova commented 5 months ago

We missed that when we applied #3812 😔

Given the fact that the starters' tests will keep failing until we fix that, I would go with option 2 if it's relatively fast.

deepyaman commented 5 months ago

Approved with a small comment to explains why this is needed and remind ourselves this can be removed in 0.20

I think this actually shouldn't be removed in 0.20? By adding this, we're once again signaling that TRANSCODING_SEPARATOR is a public attribute that people can depend upon. Furthermore, my opinion is, TRANSCODING_SEPARATOR is something that should be exposed (it's not an internal-only implementation detail, and plugins use it, e.g. for catalog parsing).

idanov commented 5 months ago

@deepyaman The addition of the module is actually necessary, because it resolves a circular dependency between node.py and pipeline.py. So not really just random move, but a necessity. The convention of private modules is not unheard of, that's the same general convention for private elements in Python, but applied for modules, which is simply not as common, but still part of the convention.

deepyaman commented 5 months ago

@deepyaman The addition of the module is actually necessary, because it resolves a circular dependency between node.py and pipeline.py. So not really just random move, but a necessity.

There are other ways to remove circular dependencies, like not importing at the top level. It's not necessarily required. But that's fine.

The convention of private modules is not unheard of, that's the same general convention for private elements in Python, but applied for modules, which is simply not as common, but still part of the convention.

Yes, there are a lot of libraries that use it very heavily. My point is that Kedro does not use it anywhere—I searched before making that comment—so it's introducing a new convention in the codebase. Even naming it transcoding instead of _transcoding would be more consistent IMO.

deepyaman commented 5 months ago

@deepyaman The addition of the module is actually necessary, because it resolves a circular dependency between node.py and pipeline.py. So not really just random move, but a necessity.

There are other ways to remove circular dependencies, like not importing at the top level. It's not necessarily required.

Created a simple PR to illustrate: https://github.com/kedro-org/kedro/pull/3826