Performance regression in cuDF merge benchmark

pentschev commented 2 years ago

Running the cuDF benchmark with RAPIDS 22.06 results in the following:

RAPIDS 22.06 cuDF benchmark

``` $ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000 2022-06-16 08:21:54,375 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize 2022-06-16 08:21:54,382 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize Merge benchmark ------------------------------- backend | dask merge type | gpu rows-per-chunk | 100000000 base-chunks | 2 other-chunks | 2 broadcast | default protocol | tcp device(s) | 1,2 rmm-pool | True frac-match | 0.3 data-processed | 5.96 GiB ================================================================================ Wall-clock | Throughput -------------------------------------------------------------------------------- 20.70 s | 294.80 MiB/s 17.62 s | 346.49 MiB/s 39.32 s | 155.22 MiB/s ================================================================================ Throughput | 265.50 MiB +/- 80.79 MiB Wall-Clock | 25.88 s +/- 9.59 s ================================================================================ (w1,w2) | 25% 50% 75% (total nbytes) ------------------------------- (01,02) | 110.55 MiB/s 153.32 MiB/s 187.99 MiB/s (12.85 GiB) (02,01) | 147.30 MiB/s 173.17 MiB/s 187.13 MiB/s (12.85 GiB) ```

If we roll back one year, to RAPIDS 21.06 performance was substantially superior:

RAPIDS 21.06 cuDF benchmark

It isn't clear where this comes from, but potential candidates seem like Distributed, cuDF or Dask-CUDA itself.

pentschev commented 2 years ago

RAPIDS 21.12 and 22.02 perform better than 21.06. The regression appeared first in 22.04, see results below.

RAPIDS 21.06 cuDF benchmark - 10 iterations

RAPIDS 21.12 cuDF benchmark - 10 iterations

RAPIDS 22.02 cuDF benchmark - 10 iterations

``` $ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000 --runs 10 distributed.preloading - INFO - Import preload module: dask_cuda.initialize distributed.preloading - INFO - Import preload module: dask_cuda.initialize Merge benchmark ------------------------------- backend | dask merge type | gpu rows-per-chunk | 100000000 base-chunks | 2 other-chunks | 2 broadcast | default protocol | tcp device(s) | 1,2 rmm-pool | True frac-match | 0.3 data-processed | 5.96 GiB ================================================================================ Wall-clock | Throughput -------------------------------------------------------------------------------- 4.79 s | 1.24 GiB/s 10.99 s | 555.41 MiB/s 10.05 s | 607.36 MiB/s 10.29 s | 593.14 MiB/s 9.94 s | 614.12 MiB/s 10.37 s | 588.66 MiB/s 4.78 s | 1.25 GiB/s 5.71 s | 1.04 GiB/s 10.13 s | 602.58 MiB/s 4.69 s | 1.27 GiB/s ================================================================================ Throughput | 848.05 MiB +/- 317.55 MiB Wall-Clock | 8.17 s +/- 2.62 s ================================================================================ (w1,w2) | 25% 50% 75% (total nbytes) ------------------------------- (01,02) | 428.54 MiB/s 478.27 MiB/s 562.20 MiB/s (48.80 GiB) (02,01) | 440.11 MiB/s 513.94 MiB/s 562.02 MiB/s (48.80 GiB) ```

RAPIDS 22.04 cuDF benchmark - 10 iterations

``` $ python local_cudf_merge.py -d 1,2 -c 100_000_000 --runs 10 2022-06-20 02:01:12,323 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize 2022-06-20 02:01:12,325 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize Merge benchmark ------------------------------- backend | dask merge type | gpu rows-per-chunk | 100000000 base-chunks | 2 other-chunks | 2 broadcast | default protocol | tcp device(s) | 1,2 rmm-pool | True frac-match | 0.3 data-processed | 5.96 GiB ================================================================================ Wall-clock | Throughput -------------------------------------------------------------------------------- 49.11 s | 124.29 MiB/s 48.77 s | 125.15 MiB/s 45.11 s | 135.31 MiB/s 44.94 s | 135.81 MiB/s 44.13 s | 138.30 MiB/s 40.67 s | 150.07 MiB/s 48.73 s | 125.25 MiB/s 15.46 s | 394.67 MiB/s 48.26 s | 126.47 MiB/s 44.82 s | 136.17 MiB/s ================================================================================ Throughput | 159.15 MiB +/- 78.88 MiB Wall-Clock | 43.00 s +/- 9.53 s ================================================================================ (w1,w2) | 25% 50% 75% (total nbytes) ------------------------------- (01,02) | 94.22 MiB/s 144.61 MiB/s 176.84 MiB/s (55.51 GiB) (02,01) | 108.26 MiB/s 126.38 MiB/s 144.91 MiB/s (55.51 GiB) ```

pentschev commented 2 years ago

The reason for this behavior is compression. Dask 2022.3.0 (RAPIDS 22.04) depends on lz4, whereas Dask 2022.1.0 (RAPIDS 22.02) doesn't.

Distributed has by default the distributed.comm.compression=auto which ends up picking lz4 when available. Disabling compression entirely incurs in a significantly better bandwidth (~5x), severely reducing total runtime (~10x).

RAPIDS 22.04 (no compression)

``` $ DASK_DISTRIBUTED__COMM__COMPRESSION=None python dask_cud a/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000 2022-06-21 04:35:28,295 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize 2022-06-21 04:35:28,298 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize Merge benchmark ------------------------------- backend | dask merge type | gpu rows-per-chunk | 100000000 base-chunks | 2 other-chunks | 2 broadcast | default protocol | tcp device(s) | 1,2 rmm-pool | True frac-match | 0.3 data-processed | 5.96 GiB ================================================================================ Wall-clock | Throughput -------------------------------------------------------------------------------- 3.96 s | 1.51 GiB/s 4.04 s | 1.47 GiB/s 3.95 s | 1.51 GiB/s ================================================================================ Throughput | 1.50 GiB +/- 16.61 MiB Wall-Clock | 3.98 s +/- 43.50 ms ================================================================================ (w1,w2) | 25% 50% 75% (total nbytes) ------------------------------- (01,02) | 655.04 MiB/s 707.73 MiB/s 717.14 MiB/s (10.62 GiB) (02,01) | 700.51 MiB/s 778.48 MiB/s 819.70 MiB/s (10.62 GiB) ```

RAPIDS 22.04 (lz4)

``` $ DASK_DISTRIBUTED__COMM__COMPRESSION=lz4 python dask_cuda /benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000 2022-06-21 04:22:57,556 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize 2022-06-21 04:22:57,558 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize Merge benchmark ------------------------------- backend | dask merge type | gpu rows-per-chunk | 100000000 base-chunks | 2 other-chunks | 2 broadcast | default protocol | tcp device(s) | 1,2 rmm-pool | True frac-match | 0.3 data-processed | 5.96 GiB ================================================================================ Wall-clock | Throughput -------------------------------------------------------------------------------- 41.16 s | 148.28 MiB/s 44.95 s | 135.77 MiB/s 45.06 s | 135.44 MiB/s ================================================================================ Throughput | 139.83 MiB +/- 5.97 MiB Wall-Clock | 43.73 s +/- 1.81 s ================================================================================ (w1,w2) | 25% 50% 75% (total nbytes) ------------------------------- (01,02) | 124.73 MiB/s 151.97 MiB/s 169.86 MiB/s (17.32 GiB) (02,01) | 115.92 MiB/s 132.03 MiB/s 144.96 MiB/s (17.32 GiB) ```

@quasiben @jakirkham do you have any ideas or suggestions on the best way to handle this? It feels to me like Dask-CUDA/Dask-cuDF should disable compression by default or find a suitable alternative to the CPU compression algorithms that are available by default.

madsbk commented 2 years ago

Good catch @pentschev !

It feels to me like Dask-CUDA/Dask-cuDF should disable compression by default or find a suitable alternative to the CPU compression algorithms that are available by default.

I agree, we should disable compression by default for now. If we want to make compression available, we could use KvikIO's Python bindings of nvCOMP.

pentschev commented 2 years ago

That is a good idea @madsbk , is this something we plan adding to Distributed? It would be good to do that and do some testing/profiling.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

wence- commented 2 years ago

Short-term fix disabling compression is in #957.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapidsai / dask-cuda

Performance regression in cuDF merge benchmark #935