Closed GregoryKimball closed 3 weeks ago
I believe that a staged copy strategy with a pinned bounce buffer will be helpful, from the standpoint of pinned memory management. We can see a similar pattern D2H for shuffle, where we have to (today) hold on to pinned memory while we write to the file system, which often is CPU bound due to compression, if not disk bound.
I would hope a MT copy strategy could be general, so we can use it for both H2D and D2H, perhaps with independent pinned bounce buffers.
Thank you @abellina for this feedback. @kingcrimsontianyu and I did some benchmarking and found good results from increasing the thread count.
./PARQUET_READER_NVBENCH -d 0 -b 1 --timeout 1 -a cardinality=0 -a run_length=1
KVIKIO_NTHREADS=8 ./PARQUET_READER_NVBENCH -d 0 -b 1 --timeout 1 -a cardinality=0 -a run_length=1
For this benchmark and the FILEPATH
datasource, going from 1->8 threads on x86-H100 yielded 141 ms to 74 ms. On GH200 the difference was 100 ms to 67 ms.
[I am opening an issue in kvikIO about MT memcpy and will update here]
@ayushdg noted that KVIKIO_NTHREADS
could also impact performance on other file-like data sources like network-attached, Lustre, slurm and others.
In terms of MT D2H/H2D memory copy, is there any callable APIs for real-world application like spark-rapids ?
Following up on this, looks like setting KVIKIO_NTHREADS=8
impacts performance negatively on data high performance network filesystem like lustre
based on internal testing.
Is your feature request related to a problem? Please describe. For hot-cache files, we can increase the IO throughput with a multi-threaded kvikIO read.
Describe the solution you'd like
We should increase the number of threads in kvikIO file reads in libcudf. You can see this effect when using kvikIO on a hot cache file data source (which is a pageable host buffer pretending to be a file), where with 8 threads we get to utilization of 80-90% and throughput of 50 GB/s. I suggest trying a behavior where we use up to 8 threads depending on the size of the file and the default task size. For example, with a task size of 4 MiB, we might use 1 thread for 0-8MiB, two threads for 8-16 MiB, etc up to 8 threads.
We may need to add new plumbing to let us change the kvikIO threadcount per read operation.
Additional context A multi-threaded copy would be great for single-threaded tools like
cudf.pandas
and could give 3-4x faster IO operations.