modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.8k stars 651 forks source link

FEAT-#5394: Reduce amount of remote calls for TreeReduce and GroupByReduce operators #7245

Closed Retribution98 closed 4 months ago

Retribution98 commented 4 months ago

Apply approaches from PR-7136 for TreeReduce and GroupByReduce operators

What do these changes do?

Retribution98 commented 4 months ago

@Retribution98 do you have any performance numbers?

@anmyachev This case is similar to the previous PR, so we can expect the same performance. Using 112 CPU: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

df.count | partitions shape | main | this PR -- | -- | -- | -- (Using 112 CPU) | (112, 1) | 0.202289 | 0.19788   | (12544, 1) | 13.67759 | 10.99517   | (112, 112) | 4.544378 | 1.760422

Retribution98 commented 4 months ago

It's also a good idea to add tests for the new operators, which now work a little differently.

Since the logic is at a lower level, I modified the test to test this and it covered all cases where map_partitions is used.

YarShev commented 4 months ago

@Retribution98, could you also check performance for dtypes, which is part of https://github.com/modin-project/modin/issues/2751?