Open pentschev opened 2 years ago
RAPIDS 21.12 and 22.02 perform better than 21.06. The regression appeared first in 22.04, see results below.
The reason for this behavior is compression. Dask 2022.3.0 (RAPIDS 22.04) depends on lz4, whereas Dask 2022.1.0 (RAPIDS 22.02) doesn't.
Distributed has by default the distributed.comm.compression=auto
which ends up picking lz4 when available. Disabling compression entirely incurs in a significantly better bandwidth (~5x), severely reducing total runtime (~10x).
@quasiben @jakirkham do you have any ideas or suggestions on the best way to handle this? It feels to me like Dask-CUDA/Dask-cuDF should disable compression by default or find a suitable alternative to the CPU compression algorithms that are available by default.
Good catch @pentschev !
It feels to me like Dask-CUDA/Dask-cuDF should disable compression by default or find a suitable alternative to the CPU compression algorithms that are available by default.
I agree, we should disable compression by default for now. If we want to make compression available, we could use KvikIO's Python bindings of nvCOMP.
That is a good idea @madsbk , is this something we plan adding to Distributed? It would be good to do that and do some testing/profiling.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
Short-term fix disabling compression is in #957.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
Running the cuDF benchmark with RAPIDS 22.06 results in the following:
RAPIDS 22.06 cuDF benchmark
``` $ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000 2022-06-16 08:21:54,375 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize 2022-06-16 08:21:54,382 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize Merge benchmark ------------------------------- backend | dask merge type | gpu rows-per-chunk | 100000000 base-chunks | 2 other-chunks | 2 broadcast | default protocol | tcp device(s) | 1,2 rmm-pool | True frac-match | 0.3 data-processed | 5.96 GiB ================================================================================ Wall-clock | Throughput -------------------------------------------------------------------------------- 20.70 s | 294.80 MiB/s 17.62 s | 346.49 MiB/s 39.32 s | 155.22 MiB/s ================================================================================ Throughput | 265.50 MiB +/- 80.79 MiB Wall-Clock | 25.88 s +/- 9.59 s ================================================================================ (w1,w2) | 25% 50% 75% (total nbytes) ------------------------------- (01,02) | 110.55 MiB/s 153.32 MiB/s 187.99 MiB/s (12.85 GiB) (02,01) | 147.30 MiB/s 173.17 MiB/s 187.13 MiB/s (12.85 GiB) ```If we roll back one year, to RAPIDS 21.06 performance was substantially superior:
RAPIDS 21.06 cuDF benchmark
``` $ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000 Merge benchmark ------------------------------- backend | dask merge type | gpu rows-per-chunk | 100000000 base-chunks | 2 other-chunks | 2 broadcast | default protocol | tcp device(s) | 1,2 rmm-pool | True frac-match | 0.3 data-processed | 5.96 GiB =============================== Wall-clock | Throughput ------------------------------- 15.40 s | 396.40 MiB/s 7.35 s | 830.55 MiB/s 8.80 s | 693.83 MiB/s =============================== (w1,w2) | 25% 50% 75% (total nbytes) ------------------------------- (01,02) | 325.82 MiB/s 332.85 MiB/s 351.81 MiB/s (12.85 GiB) (02,01) | 296.46 MiB/s 321.66 MiB/s 333.66 MiB/s (12.85 GiB) ```It isn't clear where this comes from, but potential candidates seem like Distributed, cuDF or Dask-CUDA itself.