Open merelcht opened 2 years ago
I think it's better to further split this into two issues, but I will leave my comments for both topics.
We have something like PipelineMonitoringHook in our docs. It requires some infrastructure and it's not easy to set up by regular users.
kedro-viz
side to show where's the bottleneck and help user to optimize their pipelineParallelRunner
or something similar. Currently, the workload is distributed naively, but not every node is equal.Similar to the Pipeline
's run_only_missing
, but a more sophisticated one. During development, it's common that you are working on one particular node and you just need to refresh one node(or a few dependent nodes). We can so some back-tracking.
Currently, it requires users to figure out which nodes are not necessary, and do kedro run --from-nodes
to skip unnecessary computation
One key realization of this change is that Run
need to have memory. To optimize runtime performance, it needs to know how it is run previously. To re-run the pipeline in a smart way, it needs to know the previous run(s) and figure out what's the minimal computation.
Related Issue:
This is a very frequent question actually, will try to collect more evidence for it going forward.
There's different things when considering performance, namely (1) execution time, and (2) RAM usage. There are different tools for each of these purposes, so most likely we would need dedicated efforts.
I think execution time is probably the most urgent one. This is how I used pyinstrument https://github.com/kedro-org/kedro/issues/3033#issue-1895014637
Description
Kedro currently doesn't offer any options to analyse the performance of pipelines. Additionally, our users have flagged that they would like to be able to re-run only parts of their pipeline.
Implementation ideas
Questions