[Benchmark] Add parquet read benchmark

rjzamora commented 2 months ago

Adds new benchmark for parquet read performance using a LocalCUDACluster. The user can pass in --key and --secret options to specify S3 credentials.

E.g.

$ python ./local_read_parquet.py --devs 0,1,2,3,4,5,6,7 --filesystem fsspec --type gpu --file-count 48 --aggregate-files

Parquet read benchmark
--------------------------------------------------------------------------------
Path                      | s3://dask-cudf-parquet-testing/dedup_parquet
Columns                   | None
Backend                   | cudf
Filesystem                | fsspec
Blocksize                 | 244.14 MiB
Aggregate files           | True
Row count                 | 372066
Size on disk              | 1.03 GiB
Number of workers         | 8
================================================================================
Wall clock                | Throughput
--------------------------------------------------------------------------------
36.75 s                   | 28.78 MiB/s
21.29 s                   | 49.67 MiB/s
17.91 s                   | 59.05 MiB/s
================================================================================
Throughput                | 41.77 MiB/s +/- 7.81 MiB/s
Bandwidth                 | 0 B/s +/- 0 B/s
Wall clock                | 25.32 s +/- 8.20 s
================================================================================
...

Notes:

S3 Performance generally scales with the number of workers (multiplied the number of threads per worker)
The example shown above was not executed from an EC2 instance
The example shown above should perform better after https://github.com/rapidsai/cudf/pull/16657
Using --filesystem arrow together with --type gpu performs well, but depends on https://github.com/rapidsai/cudf/pull/16684

pentschev commented 2 months ago

Performance generally scales with the number of workers (multiplied the number of threads per worker)

I'm assuming this apply to CPU-only operations, or are there CUDA kernels executed as part of this as well?

rjzamora commented 2 months ago

I'm assuming this apply to CPU-only operations, or are there CUDA kernels executed as part of this as well?

This benchmark is entirely IO/CPU bound. There is effectively no CUDA compute - we are just transferring remote data into host memory and moving it into device memory (when the default --type gpu is used). Therefore, increasing threads_per_worker * n_workers typically improves performance (because we have more threads making connections and sending requests to S3).

rjzamora commented 1 month ago

Update: I've generalized this benchmark. It's easy to use with S3 storage, but is also a useful benchmark for local-storage performance.

rjzamora commented 1 month ago

/merge

rapidsai / dask-cuda

[Benchmark] Add parquet read benchmark #1371