rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.45k stars 903 forks source link

[FEA] Increase the default thread count for kvikIO file reads #16718

Closed GregoryKimball closed 3 weeks ago

GregoryKimball commented 2 months ago

Is your feature request related to a problem? Please describe. For hot-cache files, we can increase the IO throughput with a multi-threaded kvikIO read.

Describe the solution you'd like

We should increase the number of threads in kvikIO file reads in libcudf. You can see this effect when using kvikIO on a hot cache file data source (which is a pageable host buffer pretending to be a file), where with 8 threads we get to utilization of 80-90% and throughput of 50 GB/s. I suggest trying a behavior where we use up to 8 threads depending on the size of the file and the default task size. For example, with a task size of 4 MiB, we might use 1 thread for 0-8MiB, two threads for 8-16 MiB, etc up to 8 threads.

We may need to add new plumbing to let us change the kvikIO threadcount per read operation.

Image

Additional context A multi-threaded copy would be great for single-threaded tools like cudf.pandas and could give 3-4x faster IO operations.

abellina commented 2 months ago

I believe that a staged copy strategy with a pinned bounce buffer will be helpful, from the standpoint of pinned memory management. We can see a similar pattern D2H for shuffle, where we have to (today) hold on to pinned memory while we write to the file system, which often is CPU bound due to compression, if not disk bound.

I would hope a MT copy strategy could be general, so we can use it for both H2D and D2H, perhaps with independent pinned bounce buffers.

GregoryKimball commented 2 months ago

Thank you @abellina for this feedback. @kingcrimsontianyu and I did some benchmarking and found good results from increasing the thread count.

./PARQUET_READER_NVBENCH  -d 0 -b 1 --timeout 1 -a cardinality=0 -a run_length=1
KVIKIO_NTHREADS=8 ./PARQUET_READER_NVBENCH  -d 0 -b 1 --timeout 1 -a cardinality=0 -a run_length=1

For this benchmark and the FILEPATH datasource, going from 1->8 threads on x86-H100 yielded 141 ms to 74 ms. On GH200 the difference was 100 ms to 67 ms.

[I am opening an issue in kvikIO about MT memcpy and will update here]

GregoryKimball commented 1 month ago

@ayushdg noted that KVIKIO_NTHREADS could also impact performance on other file-like data sources like network-attached, Lustre, slurm and others.

sperlingxx commented 1 month ago

In terms of MT D2H/H2D memory copy, is there any callable APIs for real-world application like spark-rapids ?

ayushdg commented 3 weeks ago

Following up on this, looks like setting KVIKIO_NTHREADS=8 impacts performance negatively on data high performance network filesystem like lustre based on internal testing.