mars-project / mars

Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and Python functions.
https://mars-project.readthedocs.io
Apache License 2.0
2.68k stars 325 forks source link

Optimize SubtaskGraph generation #3342

Open zhongchun opened 1 year ago

zhongchun commented 1 year ago

What do these changes do?

In gen_subtask_graph, Mars always create new out chunks even if the out chunk already exists. It costs a lot of time if there are plenty of chunks.

Related issue number

Fixes #3341

I did a comparison, in which one creates new out chunks and the other does not. The test scripts are:

import mars.tensor as mt
import mars.dataframe as md

size = 50000
da1 = mt.random.random((size, 2), chunk_size=(1, 2))
df1 = md.DataFrame(da1, columns=list("AB"))
df2 = df1 + 10
df3 = df2.sum()
ret = df3.execute()

Cost time of Subtask generation are: 122.92s, 56.63s.

Check code requirements