Many users and downstream applications rely on CPU/GPU transfers for interoperability, including in the zero code change interfaces for pandas, Polars, and Spark. Speed of CPU/GPU transfers can sometimes materially impact the benefit of GPU-acceleration, depending on the workflow and use case.
Today, when a Python user triggers a transfer of a GPU DataFrame to the CPU on via e.g.,to_arrow or to_pandas, every column is sequentially copied to the host with a call to cudaMemcpyAsync.
In practice, we end up not getting close to saturating the theoretically available system bandwidth. E.g., an A100 system with PCIe Gen4 x16 lanes has a peak unidirectional bandwidth of 32GB/s. In practice, we see the following with a 10 column DataFrame of 1GB int64/float64 columns:
import cudf
NROWS = 125000000
N_INT = 5
N_FLOAT = 5
DTYPES = {f"i{x}": int for x in range(N_INT)}
DTYPES.update({f"f{x}": float for x in range(N_FLOAT)})
df = cudf.datasets.randomdata(nrows=NROWS, dtypes=DTYPES, seed=12)
size_mb = df.memory_usage(deep=True).sum()/1e6
print(f"{size_mb} MB dataframe")
10000.0 MB dataframe
%timeit -n2 -r2 cpu_table = df.to_arrow()
1.99 s ± 66.4 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
To convert this GPU DataFrame into a CPU PyArrow table, we're seeing about 5GB/s -- or about 15% of theoretical peak bandwidth.
These copies already use cudaMemcpyAsync under the hood. Would pre-allocating the CPU table's buffers and using multiple streams to async fill them (with a final sync at the end) potentially be a viable path to better saturating the available system resources?
Yes, I definitely think that this kind of optimization is worth exploring. We already have a stream pool in cuIO that we could generalize and repurpose here.
Many users and downstream applications rely on CPU/GPU transfers for interoperability, including in the zero code change interfaces for pandas, Polars, and Spark. Speed of CPU/GPU transfers can sometimes materially impact the benefit of GPU-acceleration, depending on the workflow and use case.
Today, when a Python user triggers a transfer of a GPU DataFrame to the CPU on via e.g.,
to_arrow
orto_pandas
, every column is sequentially copied to the host with a call tocudaMemcpyAsync
.In practice, we end up not getting close to saturating the theoretically available system bandwidth. E.g., an A100 system with PCIe Gen4 x16 lanes has a peak unidirectional bandwidth of 32GB/s. In practice, we see the following with a 10 column DataFrame of 1GB int64/float64 columns:
To convert this GPU DataFrame into a CPU PyArrow table, we're seeing about 5GB/s -- or about 15% of theoretical peak bandwidth.
These copies already use
cudaMemcpyAsync
under the hood. Would pre-allocating the CPU table's buffers and using multiple streams to async fill them (with a final sync at the end) potentially be a viable path to better saturating the available system resources?