rapidsai / dask-cuda

Utilities for Dask and CUDA interactions
https://docs.rapids.ai/api/dask-cuda/stable/
Apache License 2.0
287 stars 91 forks source link

[Benchmark] Add parquet read benchmark #1371

Closed rjzamora closed 1 month ago

rjzamora commented 2 months ago

Adds new benchmark for parquet read performance using a LocalCUDACluster. The user can pass in --key and --secret options to specify S3 credentials.

E.g.

$ python ./local_read_parquet.py --devs 0,1,2,3,4,5,6,7 --filesystem fsspec --type gpu --file-count 48 --aggregate-files

Parquet read benchmark
--------------------------------------------------------------------------------
Path                      | s3://dask-cudf-parquet-testing/dedup_parquet
Columns                   | None
Backend                   | cudf
Filesystem                | fsspec
Blocksize                 | 244.14 MiB
Aggregate files           | True
Row count                 | 372066
Size on disk              | 1.03 GiB
Number of workers         | 8
================================================================================
Wall clock                | Throughput
--------------------------------------------------------------------------------
36.75 s                   | 28.78 MiB/s
21.29 s                   | 49.67 MiB/s
17.91 s                   | 59.05 MiB/s
================================================================================
Throughput                | 41.77 MiB/s +/- 7.81 MiB/s
Bandwidth                 | 0 B/s +/- 0 B/s
Wall clock                | 25.32 s +/- 8.20 s
================================================================================
...

Notes:

pentschev commented 2 months ago

Performance generally scales with the number of workers (multiplied the number of threads per worker)

I'm assuming this apply to CPU-only operations, or are there CUDA kernels executed as part of this as well?

rjzamora commented 2 months ago

I'm assuming this apply to CPU-only operations, or are there CUDA kernels executed as part of this as well?

This benchmark is entirely IO/CPU bound. There is effectively no CUDA compute - we are just transferring remote data into host memory and moving it into device memory (when the default --type gpu is used). Therefore, increasing threads_per_worker * n_workers typically improves performance (because we have more threads making connections and sending requests to S3).

rjzamora commented 1 month ago

Update: I've generalized this benchmark. It's easy to use with S3 storage, but is also a useful benchmark for local-storage performance.

rjzamora commented 1 month ago

/merge