rapidsai / dask-cuda

Utilities for Dask and CUDA interactions
https://docs.rapids.ai/api/dask-cuda/stable/
Apache License 2.0
292 stars 93 forks source link

Performance regression in cuDF merge benchmark #935

Open pentschev opened 2 years ago

pentschev commented 2 years ago

Running the cuDF benchmark with RAPIDS 22.06 results in the following:

RAPIDS 22.06 cuDF benchmark ``` $ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000 2022-06-16 08:21:54,375 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize 2022-06-16 08:21:54,382 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize Merge benchmark ------------------------------- backend | dask merge type | gpu rows-per-chunk | 100000000 base-chunks | 2 other-chunks | 2 broadcast | default protocol | tcp device(s) | 1,2 rmm-pool | True frac-match | 0.3 data-processed | 5.96 GiB ================================================================================ Wall-clock | Throughput -------------------------------------------------------------------------------- 20.70 s | 294.80 MiB/s 17.62 s | 346.49 MiB/s 39.32 s | 155.22 MiB/s ================================================================================ Throughput | 265.50 MiB +/- 80.79 MiB Wall-Clock | 25.88 s +/- 9.59 s ================================================================================ (w1,w2) | 25% 50% 75% (total nbytes) ------------------------------- (01,02) | 110.55 MiB/s 153.32 MiB/s 187.99 MiB/s (12.85 GiB) (02,01) | 147.30 MiB/s 173.17 MiB/s 187.13 MiB/s (12.85 GiB) ```

If we roll back one year, to RAPIDS 21.06 performance was substantially superior:

RAPIDS 21.06 cuDF benchmark ``` $ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000 Merge benchmark ------------------------------- backend | dask merge type | gpu rows-per-chunk | 100000000 base-chunks | 2 other-chunks | 2 broadcast | default protocol | tcp device(s) | 1,2 rmm-pool | True frac-match | 0.3 data-processed | 5.96 GiB =============================== Wall-clock | Throughput ------------------------------- 15.40 s | 396.40 MiB/s 7.35 s | 830.55 MiB/s 8.80 s | 693.83 MiB/s =============================== (w1,w2) | 25% 50% 75% (total nbytes) ------------------------------- (01,02) | 325.82 MiB/s 332.85 MiB/s 351.81 MiB/s (12.85 GiB) (02,01) | 296.46 MiB/s 321.66 MiB/s 333.66 MiB/s (12.85 GiB) ```

It isn't clear where this comes from, but potential candidates seem like Distributed, cuDF or Dask-CUDA itself.

pentschev commented 2 years ago

RAPIDS 21.12 and 22.02 perform better than 21.06. The regression appeared first in 22.04, see results below.

RAPIDS 21.06 cuDF benchmark - 10 iterations ``` $ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000 --runs 10 Merge benchmark ------------------------------- backend | dask merge type | gpu rows-per-chunk | 100000000 base-chunks | 2 other-chunks | 2 broadcast | default protocol | tcp device(s) | 1,2 rmm-pool | True frac-match | 0.3 data-processed | 5.96 GiB =============================== Wall-clock | Throughput ------------------------------- 8.28 s | 737.58 MiB/s 15.94 s | 382.80 MiB/s 16.27 s | 375.17 MiB/s 15.90 s | 383.76 MiB/s 15.54 s | 392.86 MiB/s 15.67 s | 389.50 MiB/s 15.52 s | 393.30 MiB/s 16.02 s | 381.04 MiB/s 7.72 s | 790.71 MiB/s 8.35 s | 730.57 MiB/s =============================== (w1,w2) | 25% 50% 75% (total nbytes) ------------------------------- (01,02) | 291.99 MiB/s 317.62 MiB/s 402.70 MiB/s (51.04 GiB) (02,01) | 295.38 MiB/s 327.51 MiB/s 401.62 MiB/s (51.03 GiB) ```
RAPIDS 21.12 cuDF benchmark - 10 iterations ``` $ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000 --runs 10 distributed.preloading - INFO - Import preload module: dask_cuda.initialize distributed.preloading - INFO - Import preload module: dask_cuda.initialize Merge benchmark ------------------------------- backend | dask merge type | gpu rows-per-chunk | 100000000 base-chunks | 2 other-chunks | 2 broadcast | default protocol | tcp device(s) | 1,2 rmm-pool | True frac-match | 0.3 data-processed | 5.96 GiB =============================== Wall-clock | Throughput ------------------------------- 5.91 s | 1.01 GiB/s 5.72 s | 1.04 GiB/s 11.16 s | 546.74 MiB/s 4.82 s | 1.24 GiB/s 4.87 s | 1.22 GiB/s 4.83 s | 1.23 GiB/s 5.72 s | 1.04 GiB/s 5.78 s | 1.03 GiB/s 5.76 s | 1.03 GiB/s 11.18 s | 546.06 MiB/s =============================== (w1,w2) | 25% 50% 75% (total nbytes) ------------------------------- (01,02) | 429.34 MiB/s 509.66 MiB/s 626.34 MiB/s (39.86 GiB) (02,01) | 419.16 MiB/s 502.99 MiB/s 633.16 MiB/s (39.86 GiB) ```
RAPIDS 22.02 cuDF benchmark - 10 iterations ``` $ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000 --runs 10 distributed.preloading - INFO - Import preload module: dask_cuda.initialize distributed.preloading - INFO - Import preload module: dask_cuda.initialize Merge benchmark ------------------------------- backend | dask merge type | gpu rows-per-chunk | 100000000 base-chunks | 2 other-chunks | 2 broadcast | default protocol | tcp device(s) | 1,2 rmm-pool | True frac-match | 0.3 data-processed | 5.96 GiB ================================================================================ Wall-clock | Throughput -------------------------------------------------------------------------------- 4.79 s | 1.24 GiB/s 10.99 s | 555.41 MiB/s 10.05 s | 607.36 MiB/s 10.29 s | 593.14 MiB/s 9.94 s | 614.12 MiB/s 10.37 s | 588.66 MiB/s 4.78 s | 1.25 GiB/s 5.71 s | 1.04 GiB/s 10.13 s | 602.58 MiB/s 4.69 s | 1.27 GiB/s ================================================================================ Throughput | 848.05 MiB +/- 317.55 MiB Wall-Clock | 8.17 s +/- 2.62 s ================================================================================ (w1,w2) | 25% 50% 75% (total nbytes) ------------------------------- (01,02) | 428.54 MiB/s 478.27 MiB/s 562.20 MiB/s (48.80 GiB) (02,01) | 440.11 MiB/s 513.94 MiB/s 562.02 MiB/s (48.80 GiB) ```
RAPIDS 22.04 cuDF benchmark - 10 iterations ``` $ python local_cudf_merge.py -d 1,2 -c 100_000_000 --runs 10 2022-06-20 02:01:12,323 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize 2022-06-20 02:01:12,325 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize Merge benchmark ------------------------------- backend | dask merge type | gpu rows-per-chunk | 100000000 base-chunks | 2 other-chunks | 2 broadcast | default protocol | tcp device(s) | 1,2 rmm-pool | True frac-match | 0.3 data-processed | 5.96 GiB ================================================================================ Wall-clock | Throughput -------------------------------------------------------------------------------- 49.11 s | 124.29 MiB/s 48.77 s | 125.15 MiB/s 45.11 s | 135.31 MiB/s 44.94 s | 135.81 MiB/s 44.13 s | 138.30 MiB/s 40.67 s | 150.07 MiB/s 48.73 s | 125.25 MiB/s 15.46 s | 394.67 MiB/s 48.26 s | 126.47 MiB/s 44.82 s | 136.17 MiB/s ================================================================================ Throughput | 159.15 MiB +/- 78.88 MiB Wall-Clock | 43.00 s +/- 9.53 s ================================================================================ (w1,w2) | 25% 50% 75% (total nbytes) ------------------------------- (01,02) | 94.22 MiB/s 144.61 MiB/s 176.84 MiB/s (55.51 GiB) (02,01) | 108.26 MiB/s 126.38 MiB/s 144.91 MiB/s (55.51 GiB) ```
pentschev commented 2 years ago

The reason for this behavior is compression. Dask 2022.3.0 (RAPIDS 22.04) depends on lz4, whereas Dask 2022.1.0 (RAPIDS 22.02) doesn't.

Distributed has by default the distributed.comm.compression=auto which ends up picking lz4 when available. Disabling compression entirely incurs in a significantly better bandwidth (~5x), severely reducing total runtime (~10x).

RAPIDS 22.04 (no compression) ``` $ DASK_DISTRIBUTED__COMM__COMPRESSION=None python dask_cud a/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000 2022-06-21 04:35:28,295 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize 2022-06-21 04:35:28,298 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize Merge benchmark ------------------------------- backend | dask merge type | gpu rows-per-chunk | 100000000 base-chunks | 2 other-chunks | 2 broadcast | default protocol | tcp device(s) | 1,2 rmm-pool | True frac-match | 0.3 data-processed | 5.96 GiB ================================================================================ Wall-clock | Throughput -------------------------------------------------------------------------------- 3.96 s | 1.51 GiB/s 4.04 s | 1.47 GiB/s 3.95 s | 1.51 GiB/s ================================================================================ Throughput | 1.50 GiB +/- 16.61 MiB Wall-Clock | 3.98 s +/- 43.50 ms ================================================================================ (w1,w2) | 25% 50% 75% (total nbytes) ------------------------------- (01,02) | 655.04 MiB/s 707.73 MiB/s 717.14 MiB/s (10.62 GiB) (02,01) | 700.51 MiB/s 778.48 MiB/s 819.70 MiB/s (10.62 GiB) ```
RAPIDS 22.04 (lz4) ``` $ DASK_DISTRIBUTED__COMM__COMPRESSION=lz4 python dask_cuda /benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000 2022-06-21 04:22:57,556 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize 2022-06-21 04:22:57,558 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize Merge benchmark ------------------------------- backend | dask merge type | gpu rows-per-chunk | 100000000 base-chunks | 2 other-chunks | 2 broadcast | default protocol | tcp device(s) | 1,2 rmm-pool | True frac-match | 0.3 data-processed | 5.96 GiB ================================================================================ Wall-clock | Throughput -------------------------------------------------------------------------------- 41.16 s | 148.28 MiB/s 44.95 s | 135.77 MiB/s 45.06 s | 135.44 MiB/s ================================================================================ Throughput | 139.83 MiB +/- 5.97 MiB Wall-Clock | 43.73 s +/- 1.81 s ================================================================================ (w1,w2) | 25% 50% 75% (total nbytes) ------------------------------- (01,02) | 124.73 MiB/s 151.97 MiB/s 169.86 MiB/s (17.32 GiB) (02,01) | 115.92 MiB/s 132.03 MiB/s 144.96 MiB/s (17.32 GiB) ```

@quasiben @jakirkham do you have any ideas or suggestions on the best way to handle this? It feels to me like Dask-CUDA/Dask-cuDF should disable compression by default or find a suitable alternative to the CPU compression algorithms that are available by default.

madsbk commented 2 years ago

Good catch @pentschev !

It feels to me like Dask-CUDA/Dask-cuDF should disable compression by default or find a suitable alternative to the CPU compression algorithms that are available by default.

I agree, we should disable compression by default for now. If we want to make compression available, we could use KvikIO's Python bindings of nvCOMP.

pentschev commented 2 years ago

That is a good idea @madsbk , is this something we plan adding to Distributed? It would be good to do that and do some testing/profiling.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

wence- commented 2 years ago

Short-term fix disabling compression is in #957.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.