FEAT-#5394: Reduce amount of remote calls for TreeReduce and GroupByReduce operators

modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code

http://modin.readthedocs.io

Apache License 2.0

9.8k stars 651 forks source link

FEAT-#5394: Reduce amount of remote calls for TreeReduce and GroupByReduce operators #7245

Closed Retribution98 closed 4 months ago

Retribution98 commented 4 months ago

Apply approaches from PR-7136 for TreeReduce and GroupByReduce operators

What do these changes do?

[x] first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
[x] passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
[x] passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
[x] signed commit with git commit -s
[x] Resolves #5394
[x] tests added and passing
[x] module layout described at docs/development/architecture.rst is up-to-date

Retribution98 commented 4 months ago

@Retribution98 do you have any performance numbers?

@anmyachev This case is similar to the previous PR, so we can expect the same performance. Using 112 CPU: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

df.count | partitions shape | main | this PR -- | -- | -- | -- (Using 112 CPU) | (112, 1) | 0.202289 | 0.19788 | (12544, 1) | 13.67759 | 10.99517 | (112, 112) | 4.544378 | 1.760422

Retribution98 commented 4 months ago

It's also a good idea to add tests for the new operators, which now work a little differently.

Since the logic is at a lower level, I modified the test to test this and it covered all cases where map_partitions is used.

YarShev commented 4 months ago

@Retribution98, could you also check performance for dtypes, which is part of https://github.com/modin-project/modin/issues/2751?