Closed rjzamora closed 1 month ago
Performance generally scales with the number of workers (multiplied the number of threads per worker)
I'm assuming this apply to CPU-only operations, or are there CUDA kernels executed as part of this as well?
I'm assuming this apply to CPU-only operations, or are there CUDA kernels executed as part of this as well?
This benchmark is entirely IO/CPU bound. There is effectively no CUDA compute - we are just transferring remote data into host memory and moving it into device memory (when the default --type gpu
is used). Therefore, increasing threads_per_worker * n_workers
typically improves performance (because we have more threads making connections and sending requests to S3).
Update: I've generalized this benchmark. It's easy to use with S3 storage, but is also a useful benchmark for local-storage performance.
/merge
Adds new benchmark for parquet read performance using a
LocalCUDACluster
. The user can pass in--key
and--secret
options to specify S3 credentials.E.g.
Notes:
--filesystem arrow
together with--type gpu
performs well, but depends on https://github.com/rapidsai/cudf/pull/16684