Closed idanov closed 6 months ago
Awesome job, huge improvement 👏🏻
As for:
potentially making Node and Pipeline use attrs. The latter will help ensuring that they remain immutable, as apparently a previous contribution snuck-in mutability to the Node class which is against the idea of stateless nodes
That change would be really unfortunate, because the flow of having a hook that changes the node.func
at runtime is a common pattern I've seen (and also used / recommended) multiple times.
Examples:
This is a side conversation, not related to the PR, but responding to:
That change would be really unfortunate, because the flow of having a hook that changes the
node.func
at runtime is a common pattern I've seen (and also used / recommended) multiple times.
The immutability change will be a breaking change unfortunately, so unlikely to happen soon. Nevertheless we can make them attrs
objects even without making them fully immutable and with no breaking changes.
The introduction of mutability was already a mistake we should've avoided in a first place. Immutable objects is one of the best ways to ensure that you can pass around a node without copying and make the code safe and bug free. There are different patterns we can apply in order to address your use cases without needing mutability.
The current pattern is quite unsafe, e.g. a plugin can attach a completely different function, as there are no validations applied. Moreover, it is the only mutable method there, e.g. if you apply new tags, you get a new copy of the node and you don't modify the current node. It's completely out of place from the current functioning and idea of the nodes and what makes a node node, and not just a function.
Description
Creating large pipelines in Kedro is very slow and can take tens of seconds as reported in https://github.com/kedro-org/kedro/issues/3167
After some investigation, it turned out that number of factors contributed to that:
self.nodes
within thePipeline
class triggers many unneeded copies (original PRs: https://github.com/kedro-org/kedro/pull/3146)Pipeline
objectPipeline
object creation, caused by each node being copied and revalidatedThis PR addresses the first two completely, while it addresses the third one partially only when no new tags are added. The first one was addressed by https://github.com/kedro-org/kedro/pull/3146, but it's still not merged yet.
Development notes
For testing, @marrrcin 's test from https://github.com/kedro-org/kedro/issues/3167 was used and https://pyinstrument.readthedocs.io/en/latest/ profiling was run before and after.
After the changes from this PR and https://github.com/kedro-org/kedro/pull/3728, we've reduced the time it takes to sum 51 pipelines from ~15s down to ~6s, which is about 60% reduction in time. All of that was tested on Python 3.8 with the
graphlib
backport, it's possible that the built-ingraphlib
is much faster than the backport and might yield better results.Further improvements could be done by removing unnecessary
set()
andlist()
operations, doing a lightweight check for cycles without the need of instantiatinggraphlib.TopologicalSorter
upon init and potentially makingNode
andPipeline
use attrs. The latter will help ensuring that they remain immutable, as apparently a previous contribution snuck-in mutability to theNode
class which is against the idea of stateless nodes: https://github.com/kedro-org/kedro/blob/0fc8089b637a0679f71e2bddc91f0676fc2914a2/kedro/pipeline/node.py#L231-L239Before:
After:
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file