kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.53k stars 879 forks source link

Drop dependency on toposort in favour of built-in graphlib #3728

Closed idanov closed 4 months ago

idanov commented 4 months ago

Description

From Python 3.9, there's a built-in toposort functionality and there's a backport for Python 3.8. This PR drops the dependency on a third-party library in favour of the built-in solution thus reducing Kedro's dependency footprint. This is fully backwards compatible change as no tests have failed with the change.

Development notes

Current test coverage is sufficient.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

datajoely commented 4 months ago

Amazing!

idanov commented 4 months ago

image i tested and it's not deterministic runs, the original problem was somewhere due to a set operation.

See: #1604

Could you explain a bit further? I don't think I see non-deterministic task-ordering from your screenshot. All runs seem to run all the tasks deterministically, the only non-deterministic thing here is the order of loading the datasets, which isn't the subject of this PR.

noklam commented 4 months ago

@idanov I overlook the loading order as the task order. I can confirm the task order is deterministic. The non-deterministic loading is irrelevant and is not introduced by this PR.