[FEA] Use SMs to submit small copies to prevent serialization on a busy copy engine

abellina commented 7 months ago

We have seen patterns where small cudaMemcpyAsync collide with large cudaMemcpyAsync being handled by the copy engine. Importantly, the small copy is in a different stream than the large copy. In the example below, we can see a H2D pinned copy of 51KB that was scheduled with a latency of 12ms because there is another pinned H2D copy happening at the same time (the larger copy is ~200MB).

2024-04-30_10-35

The big issue behind this pattern is that Stream 30 in this case is serializing because nothing else will run in the stream until this small copy is done. Usually when we invoke kernels in cuDF there is a pattern of: small H2Ds, followed by kernel invocation, then small D2Hs. Any of the pre/post copies done around a kernel is a candidate to get stuck, serializing all the work in that stream.

We have a PoC that uses thrust::copy_n to copy from pinned to device memory and viceversa using SMs instead of the copy engine, as kernels can directly touch pinned memory. When such an approach is followed, the small copy is able to run with much less latency and subsequent work in the stream is unblocked. This leads to kernels running at the same time as large copies, which is a desirable pattern.

This issue was created in order to track this work and to figure out how to bring these changes to cuDF in a configurable way so we don't affect serial workloads, as the effect is really prominent for parallel workloads such as Spark.

This is NDS q9 at 3TB, looking at the CUDA HW row, we can see how the compute (blue) overlaps more often than not with pinned H2Ds (green).

Before: 2024-04-30_10-43

After: 2024-04-30_10-44

bdice commented 7 months ago

Are there any ways in which CCCL's cuda::memcpy_async could help us here? https://nvidia.github.io/cccl/libcudacxx/extended_api/asynchronous_operations/memcpy_async.html

I'm not super familiar with the facilities libcudacxx provides for this, but perhaps this could be in scope.

abellina commented 7 months ago

@bdice I am not familiar with it either. A quick code search is yielding that cuda::memcpy_async handles D2D copies (shared, global). I suppose we could refer to pinned memory as another "device" memory type, so in that case I see some relation, but the main thing is code we use to orchestrate copies should likely fallback to cudaMemcpyAsync: if pinned memory isn't available to bounce the copy through, or if it's disabled. I believe single threaded applications (single stream) may want to use cudaMemcpyAsync as is, because there shouldn't be much contention on the CE in those cases.

jrhemstad commented 7 months ago

I suspect many of the small copies in cuDF come from rmm::device_scalar. We could just update the implementation to use a kernel instead of cudaMemcpy*.

vuule commented 6 months ago

I suspect many of the small copies in cuDF come from rmm::device_scalar. We could just update the implementation to use a kernel instead of cudaMemcpy*.

AFAIK we also need to get device_scalar to use pinned memory. Which implies that we'd also need to pass a host memory resource at to enable the use of a pinned pool.

GregoryKimball commented 5 months ago

Thanks to the new benchmarks available in https://github.com/rapidsai/cudf/pull/15585, now we can measure multi-thread/multi-stream performance more easily. Here is an analysis I did for some particular benchmark settings on x86-A100 with the default pageable host buffer data source. The benchmark was collected on 1354abdb7 before the changes in https://github.com/rapidsai/cudf/pull/15830. The "speed of light" line assumes SM utilization increases from 78% to 100%, and that all copying after the first thread is fully pipelined with compute on other threads.

each thread reads a 1 GB parquet file from a host buffer data source, e.g.: ./PARQUET_MULTITHREAD_READER_NVBENCH -d 0 -b 0 --axis num_cols=32 --axis run_length=2 --axis total_data_size=16000000000 --axis num_threads=16

Multi-stream pipelining shows an improvement in throughput from ~28 GB/s with 1 thread reading 1 GB to ~37 GB/s with 16 threads reading 16 GB. With reduced copy engine contention, SOL analysis of the profile suggests that 16 threads could reach ~47 GB/s.

rapidsai / cudf

[FEA] Use SMs to submit small copies to prevent serialization on a busy copy engine #15620