[QST] Would using streams for async copying table / DataFrame columns speed up CPU/GPU transfers?

Many users and downstream applications rely on CPU/GPU transfers for interoperability, including in the zero code change interfaces for pandas, Polars, and Spark. Speed of CPU/GPU transfers can sometimes materially impact the benefit of GPU-acceleration, depending on the workflow and use case.

Today, when a Python user triggers a transfer of a GPU DataFrame to the CPU on via e.g.,to_arrow or to_pandas, every column is sequentially copied to the host with a call to cudaMemcpyAsync.

In practice, we end up not getting close to saturating the theoretically available system bandwidth. E.g., an A100 system with PCIe Gen4 x16 lanes has a peak unidirectional bandwidth of 32GB/s. In practice, we see the following with a 10 column DataFrame of 1GB int64/float64 columns:

import cudf

NROWS = 125000000
N_INT = 5
N_FLOAT = 5

DTYPES = {f"i{x}": int for x in range(N_INT)}
DTYPES.update({f"f{x}": float for x in range(N_FLOAT)})

df = cudf.datasets.randomdata(nrows=NROWS, dtypes=DTYPES, seed=12)
size_mb = df.memory_usage(deep=True).sum()/1e6

print(f"{size_mb} MB dataframe")
10000.0 MB dataframe

%timeit -n2 -r2 cpu_table = df.to_arrow()
1.99 s ± 66.4 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)

To convert this GPU DataFrame into a CPU PyArrow table, we're seeing about 5GB/s -- or about 15% of theoretical peak bandwidth.

These copies already use cudaMemcpyAsync under the hood. Would pre-allocating the CPU table's buffers and using multiple streams to async fill them (with a final sync at the end) potentially be a viable path to better saturating the available system resources?

rapidsai / cudf

[QST] Would using streams for async copying table / DataFrame columns speed up CPU/GPU transfers? #17082