kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.89k stars 897 forks source link

Optimise pipeline addition and creation #3730

Closed idanov closed 6 months ago

idanov commented 6 months ago

Description

Creating large pipelines in Kedro is very slow and can take tens of seconds as reported in https://github.com/kedro-org/kedro/issues/3167

After some investigation, it turned out that number of factors contributed to that:

This PR addresses the first two completely, while it addresses the third one partially only when no new tags are added. The first one was addressed by https://github.com/kedro-org/kedro/pull/3146, but it's still not merged yet.

Development notes

For testing, @marrrcin 's test from https://github.com/kedro-org/kedro/issues/3167 was used and https://pyinstrument.readthedocs.io/en/latest/ profiling was run before and after.

After the changes from this PR and https://github.com/kedro-org/kedro/pull/3728, we've reduced the time it takes to sum 51 pipelines from ~15s down to ~6s, which is about 60% reduction in time. All of that was tested on Python 3.8 with the graphlib backport, it's possible that the built-in graphlib is much faster than the backport and might yield better results.

Further improvements could be done by removing unnecessary set() and list() operations, doing a lightweight check for cycles without the need of instantiating graphlib.TopologicalSorter upon init and potentially making Node and Pipeline use attrs. The latter will help ensuring that they remain immutable, as apparently a previous contribution snuck-in mutability to the Node class which is against the idea of stateless nodes: https://github.com/kedro-org/kedro/blob/0fc8089b637a0679f71e2bddc91f0676fc2914a2/kedro/pipeline/node.py#L231-L239

Before:

╰─❯ pyinstrument --show '*/kedro/pipeline/*' -m kedro registry list
Sum of 1 pipelines took: 0.000s
Sum of 11 pipelines took: 0.685s
Sum of 21 pipelines took: 2.514s
Sum of 31 pipelines took: 5.570s
Sum of 41 pipelines took: 10.029s
Sum of 51 pipelines took: 15.612s
- __default__
- data_processing
- data_science

  _     ._   __/__   _ _  _  _ _/_   Recorded: 18:16:21  Samples:  58702
 /_//_/// /_\ / //_// / //_'/ //     Duration: 72.331    CPU time: 67.571
/   _/                      v4.6.2

Program: pyinstrument --show */kedro/pipeline/* -m kedro registry list

72.323 <module>  kedro/__main__.py:1
├─ 69.608 main  kedro/framework/cli/cli.py:225
│     [38 frames hidden]  kedro, click, importlib_metadata, imp...
│        63.056 _ProjectPipelines._load_data  kedro/framework/project/__init__.py:176
│        └─ 63.054 register_pipelines  kedro_spaceflights/pipeline_registry.py:8
│           └─ 62.366 find_pipelines  kedro/framework/project/__init__.py:322
│                 [3 frames hidden]  kedro, importlib
│                    58.453 _create_pipeline  kedro/framework/project/__init__.py:299
│                    └─ 58.447 create_pipeline  kedro_spaceflights/pipelines/data_processing/pipeline.py:7
│                       ├─ 56.910 Pipeline.__add__  kedro/pipeline/pipeline.py:181
│                       │  ├─ 55.767 Pipeline.__init__  kedro/pipeline/pipeline.py:80
│                       │  │  ├─ 28.988 _topologically_sorted  kedro/pipeline/pipeline.py:887
│                       │  │  │  └─ 28.988 <listcomp>  kedro/pipeline/pipeline.py:912
│                       │  │  │     ├─ 22.526 Node.__lt__  kedro/pipeline/node.py:184
│                       │  │  │     │  ├─ 19.511 Node._unique_key  kedro/pipeline/node.py:165
│                       │  │  │     │  │  ├─ 7.283 hashable  kedro/pipeline/node.py:167
│                       │  │  │     │  │  │  ├─ 4.472 [self]  kedro/pipeline/node.py
│                       │  │  │     │  │  │  └─ 2.811 isinstance  <built-in>
│                       │  │  │     │  │  ├─ 6.322 Node.name  kedro/pipeline/node.py:264
│                       │  │  │     │  │  │  ├─ 4.475 [self]  kedro/pipeline/node.py
│                       │  │  │     │  │  │  └─ 1.847 Node.namespace  kedro/pipeline/node.py:289
│                       │  │  │     │  │  └─ 5.906 [self]  kedro/pipeline/node.py
│                       │  │  │     │  └─ 2.698 [self]  kedro/pipeline/node.py
│                       │  │  │     └─ 5.780 toposort  toposort.py:47
│                       │  │  │           [4 frames hidden]  toposort
│                       │  │  │              2.199 Node.__hash__  kedro/pipeline/node.py:189
│                       │  │  │              └─ 1.839 Node._unique_key  kedro/pipeline/node.py:165
│                       │  │  │              2.094 <dictcomp>  toposort.py:61
│                       │  │  │              ├─ 1.109 Node.__hash__  kedro/pipeline/node.py:189
│                       │  │  │              │  └─ 0.954 Node._unique_key  kedro/pipeline/node.py:165
│                       │  │  │              1.186 <dictcomp>  toposort.py:79
│                       │  │  │              └─ 1.093 Node.__hash__  kedro/pipeline/node.py:189
│                       │  │  │                 └─ 0.900 Node._unique_key  kedro/pipeline/node.py:165
│                       │  │  ├─ 15.735 <listcomp>  kedro/pipeline/pipeline.py:148
│                       │  │  │  └─ 15.581 Node.tag  kedro/pipeline/node.py:251
│                       │  │  │     └─ 14.616 Node._copy  kedro/pipeline/node.py:145
│                       │  │  │        └─ 14.009 Node.__init__  kedro/pipeline/node.py:22
│                       │  │  │           ├─ 9.112 Node._validate_inputs  kedro/pipeline/node.py:501
│                       │  │  │           │  ├─ 4.069 signature  inspect.py:3103
│                       │  │  │           │  │     [7 frames hidden]  inspect
│                       │  │  │           │  └─ 3.861 Signature.bind  inspect.py:3032
│                       │  │  │           │        [3 frames hidden]  inspect
│                       │  │  │           ├─ 1.532 Node._validate_unique_outputs  kedro/pipeline/node.py:521
│                       │  │  │           │  └─ 0.812 Counter.__init__  collections/__init__.py:540
│                       │  │  │           ├─ 1.294 [self]  kedro/pipeline/node.py
│                       │  │  │           └─ 0.901 Node._validate_inputs_dif_than_outputs  kedro/pipeline/node.py:530
│                       │  │  ├─ 3.754 Pipeline.node_dependencies  kedro/pipeline/pipeline.py:325
│                       │  │  │  ├─ 2.148 <dictcomp>  kedro/pipeline/pipeline.py:334
│                       │  │  │  │  └─ 2.015 Node.__hash__  kedro/pipeline/node.py:189
│                       │  │  │  │     └─ 1.836 Node._unique_key  kedro/pipeline/node.py:165
│                       │  │  │  │        └─ 1.217 [self]  kedro/pipeline/node.py
│                       │  │  │  └─ 0.936 [self]  kedro/pipeline/pipeline.py
│                       │  │  ├─ 1.259 _validate_transcoded_inputs_outputs  kedro/pipeline/pipeline.py:861
│                       │  │  ├─ 1.076 Node.__hash__  kedro/pipeline/node.py:189
│                       │  │  │  └─ 0.911 Node._unique_key  kedro/pipeline/node.py:165
│                       │  │  ├─ 0.883 _strip_transcoding  kedro/pipeline/pipeline.py:46
│                       │  │  ├─ 0.879 _validate_unique_outputs  kedro/pipeline/pipeline.py:839
│                       │  │  │  └─ 0.854 Counter.__init__  collections/__init__.py:540
│                       │  │  │        [2 frames hidden]  collections
│                       │  │  └─ 0.803 <listcomp>  kedro/pipeline/pipeline.py:142
│                       │  │     └─ 0.775 [self]  kedro/pipeline/pipeline.py
│                       │  └─ 1.025 Node.__hash__  kedro/pipeline/node.py:189
│                       │     └─ 0.852 Node._unique_key  kedro/pipeline/node.py:165
│                       └─ 1.285 pipeline  kedro/pipeline/modular_pipeline.py:167
│                          └─ 0.991 Pipeline.__init__  kedro/pipeline/pipeline.py:80
│                    3.895 import_module  importlib/__init__.py:109
│                    └─ 3.864 <module>  kedro_spaceflights/pipelines/data_science/__init__.py:1
│                       └─ 3.859 <module>  kedro_spaceflights/pipelines/data_science/pipeline.py:1
│                          └─ 3.857 <module>  kedro_spaceflights/pipelines/data_science/nodes.py:1
│                             └─ 3.229 <module>  sklearn/__init__.py:1
│                                   [13 frames hidden]  sklearn, scipy, importlib
└─ 2.709 <module>  kedro/framework/cli/__init__.py:1
      [4 frames hidden]  kedro

After:

╰─❯ pyinstrument --show '*/kedro/pipeline/*' -m kedro registry list
Sum of 1 pipelines took: 0.000s
Sum of 11 pipelines took: 0.276s
Sum of 21 pipelines took: 1.099s
Sum of 31 pipelines took: 2.391s
Sum of 41 pipelines took: 4.181s
Sum of 51 pipelines took: 6.448s
- __default__
- data_processing
- data_science

  _     ._   __/__   _ _  _  _ _/_   Recorded: 18:02:59  Samples:  25956
 /_//_/// /_\ / //_// / //_'/ //     Duration: 36.311    CPU time: 33.158
/   _/                      v4.6.2

Program: pyinstrument --show */kedro/pipeline/* -m kedro registry list

36.305 <module>  kedro/__main__.py:1
├─ 34.037 main  kedro/framework/cli/cli.py:225
│     [53 frames hidden]  kedro, click, importlib_metadata, imp...
│        27.958 _ProjectPipelines._load_data  kedro/framework/project/__init__.py:176
│        └─ 27.953 register_pipelines  kedro_spaceflights/pipeline_registry.py:8
│           └─ 27.607 find_pipelines  kedro/framework/project/__init__.py:322
│                 [3 frames hidden]  kedro, importlib
│                    24.602 _create_pipeline  kedro/framework/project/__init__.py:299
│                    └─ 24.598 create_pipeline  kedro_spaceflights/pipelines/data_processing/pipeline.py:7
│                       ├─ 23.641 Pipeline.__add__  kedro/pipeline/pipeline.py:192
│                       │  ├─ 22.479 Pipeline.__init__  kedro/pipeline/pipeline.py:78
│                       │  │  ├─ 7.657 TopologicalSorter.prepare  graphlib/graphlib.py:84
│                       │  │  │     [3 frames hidden]  graphlib
│                       │  │  │        7.588 TopologicalSorter._find_cycle  graphlib/graphlib.py:196
│                       │  │  │        ├─ 6.419 Node.__hash__  kedro/pipeline/node.py:189
│                       │  │  │        │  ├─ 5.352 Node._unique_key  kedro/pipeline/node.py:165
│                       │  │  │        │  │  ├─ 1.937 hashable  kedro/pipeline/node.py:167
│                       │  │  │        │  │  │  ├─ 1.187 [self]  kedro/pipeline/node.py
│                       │  │  │        │  │  │  └─ 0.750 isinstance  <built-in>
│                       │  │  │        │  │  ├─ 1.810 Node.name  kedro/pipeline/node.py:264
│                       │  │  │        │  │  │  ├─ 1.262 [self]  kedro/pipeline/node.py
│                       │  │  │        │  │  │  └─ 0.548 Node.namespace  kedro/pipeline/node.py:289
│                       │  │  │        │  │  └─ 1.605 [self]  kedro/pipeline/node.py
│                       │  │  │        │  ├─ 0.685 [self]  kedro/pipeline/node.py
│                       │  │  │        │  └─ 0.382 hash  <built-in>
│                       │  │  ├─ 4.555 TopologicalSorter.__init__  graphlib/graphlib.py:41
│                       │  │  │     [4 frames hidden]  graphlib
│                       │  │  │        3.326 TopologicalSorter._get_nodeinfo  graphlib/graphlib.py:51
│                       │  │  │        └─ 2.899 Node.__hash__  kedro/pipeline/node.py:189
│                       │  │  │           └─ 2.515 Node._unique_key  kedro/pipeline/node.py:165
│                       │  │  │              ├─ 1.310 [self]  kedro/pipeline/node.py
│                       │  │  │              ├─ 0.609 Node.name  kedro/pipeline/node.py:264
│                       │  │  │              │  └─ 0.442 [self]  kedro/pipeline/node.py
│                       │  │  │              └─ 0.596 hashable  kedro/pipeline/node.py:167
│                       │  │  ├─ 3.309 Pipeline.node_dependencies  kedro/pipeline/pipeline.py:336
│                       │  │  │  ├─ 1.755 <dictcomp>  kedro/pipeline/pipeline.py:345
│                       │  │  │  │  └─ 1.633 Node.__hash__  kedro/pipeline/node.py:189
│                       │  │  │  │     └─ 1.453 Node._unique_key  kedro/pipeline/node.py:165
│                       │  │  │  │        └─ 0.827 [self]  kedro/pipeline/node.py
│                       │  │  │  ├─ 0.895 [self]  kedro/pipeline/pipeline.py
│                       │  │  │  └─ 0.411 _strip_transcoding  kedro/pipeline/pipeline.py:44
│                       │  │  ├─ 1.282 _validate_transcoded_inputs_outputs  kedro/pipeline/pipeline.py:882
│                       │  │  │  └─ 0.427 _strip_transcoding  kedro/pipeline/pipeline.py:44
│                       │  │  ├─ 1.090 Node.__hash__  kedro/pipeline/node.py:189
│                       │  │  │  └─ 0.885 Node._unique_key  kedro/pipeline/node.py:165
│                       │  │  ├─ 0.900 _strip_transcoding  kedro/pipeline/pipeline.py:44
│                       │  │  │  └─ 0.650 _transcode_split  kedro/pipeline/pipeline.py:21
│                       │  │  │     └─ 0.405 [self]  kedro/pipeline/pipeline.py
│                       │  │  ├─ 0.872 _validate_unique_outputs  kedro/pipeline/pipeline.py:860
│                       │  │  │  └─ 0.844 Counter.__init__  collections/__init__.py:540
│                       │  │  │        [2 frames hidden]  collections
│                       │  │  │           0.844 Counter.update  collections/__init__.py:608
│                       │  │  │           └─ 0.404 _strip_transcoding  kedro/pipeline/pipeline.py:44
│                       │  │  ├─ 0.628 _validate_duplicate_nodes  kedro/pipeline/pipeline.py:825
│                       │  │  │  └─ 0.509 _check_node  kedro/pipeline/pipeline.py:829
│                       │  │  ├─ 0.534 [self]  kedro/pipeline/pipeline.py
│                       │  │  ├─ 0.429 <dictcomp>  kedro/pipeline/pipeline.py:151
│                       │  │  └─ 0.423 <listcomp>  kedro/pipeline/pipeline.py:140
│                       │  │     └─ 0.393 [self]  kedro/pipeline/pipeline.py
│                       │  └─ 1.083 Node.__hash__  kedro/pipeline/node.py:189
│                       │     └─ 0.910 Node._unique_key  kedro/pipeline/node.py:165
│                       └─ 0.857 pipeline  kedro/pipeline/modular_pipeline.py:167
│                          └─ 0.553 Pipeline.__init__  kedro/pipeline/pipeline.py:78
│                    2.978 import_module  importlib/__init__.py:109
│                    └─ 2.969 <module>  kedro_spaceflights/pipelines/data_science/__init__.py:1
│                       └─ 2.968 <module>  kedro_spaceflights/pipelines/data_science/pipeline.py:1
│                          └─ 2.966 <module>  kedro_spaceflights/pipelines/data_science/nodes.py:1
│                             ├─ 2.391 <module>  sklearn/__init__.py:1
│                             │     [16 frames hidden]  sklearn, scipy
│                             └─ 0.571 <module>  sklearn/linear_model/__init__.py:1
└─ 2.262 <module>  kedro/framework/cli/__init__.py:1
      [11 frames hidden]  kedro, dynaconf

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

marrrcin commented 6 months ago

Awesome job, huge improvement 👏🏻

As for:

potentially making Node and Pipeline use attrs. The latter will help ensuring that they remain immutable, as apparently a previous contribution snuck-in mutability to the Node class which is against the idea of stateless nodes

That change would be really unfortunate, because the flow of having a hook that changes the node.func at runtime is a common pattern I've seen (and also used / recommended) multiple times.

Examples:

idanov commented 6 months ago

This is a side conversation, not related to the PR, but responding to:

That change would be really unfortunate, because the flow of having a hook that changes the node.func at runtime is a common pattern I've seen (and also used / recommended) multiple times.

The immutability change will be a breaking change unfortunately, so unlikely to happen soon. Nevertheless we can make them attrs objects even without making them fully immutable and with no breaking changes.

The introduction of mutability was already a mistake we should've avoided in a first place. Immutable objects is one of the best ways to ensure that you can pass around a node without copying and make the code safe and bug free. There are different patterns we can apply in order to address your use cases without needing mutability.

The current pattern is quite unsafe, e.g. a plugin can attach a completely different function, as there are no validations applied. Moreover, it is the only mutable method there, e.g. if you apply new tags, you get a new copy of the node and you don't modify the current node. It's completely out of place from the current functioning and idea of the nodes and what makes a node node, and not just a function.