This PR reduces the load of storage service (the object store in Ray backend). For example, with this PR, put data size of TPCH 1GB q05 can be reduced from 3282932389 to 2403252033, about 27%. Rechunk and auto merging are optimized most.
Avoid generating concat op with only 1 input. (Mainly are generated by auto merging and rechunk)
Before (Many DataFrameConcats that have only 1 input)After (The DataFrameConcat has only 1 input has been removed)
Fuse concat op with successor subtask.
Before (DataFrameConcat is a standalone subtask)After (DataFrameConcat is fused with it's successor)
Related issue number
Fixes #xxxx
Check code requirements
[ ] tests added / passed (if needed)
[ ] Ensure all linting tests pass, see here for how to run them
What do these changes do?
This PR reduces the load of storage service (the object store in Ray backend). For example, with this PR, put data size of TPCH 1GB q05 can be reduced from 3282932389 to 2403252033, about 27%. Rechunk and auto merging are optimized most.
Avoid generating concat op with only 1 input. (Mainly are generated by auto merging and rechunk) Before (Many DataFrameConcats that have only 1 input) After (The DataFrameConcat has only 1 input has been removed)
Fuse concat op with successor subtask. Before (DataFrameConcat is a standalone subtask) After (DataFrameConcat is fused with it's successor)
Related issue number
Fixes #xxxx
Check code requirements