mars-project / mars

Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and Python functions.
https://mars-project.readthedocs.io
Apache License 2.0
2.68k stars 325 forks source link

[GRAPH][ENHANCEMENT] Optimize `SubtakGraph` generation #3341

Open zhongchun opened 1 year ago

zhongchun commented 1 year ago

Problem

In gen_subtask_graph, Mars always create new out chunks even if the out chunk already exists. It costs a lot of time if there are plenty of chunks. Actually, there is no need to create out chunks except the FetchChunk. I did a comparison, in which one creates new out chunks and the other does not. The result shows that the cost is reduced from 122.92s to 56.63s with about 53.93% reduction.

So we should optimize the SubtaskGraph generation and make it more efficient.