Open randerzander opened 1 year ago
This is very messy for read_csv
, because we do need the data on the host in the current implementation. So this would lead to additional H2D and D2H copies.
Other than that, it should be possible to add a parameter to request device-side decompression. Not a trivial change in read_csv
, though.
I have a lot of gzip compressed CSV files. When I use cudf to read them, the host handles decompression before copying decompressed data to device.
For cudf, that's not a problem, since it'll at worst be no slower than CPU.
But when I read w/ dask_cudf, compared to CPU dask.dataframe, I will usually have <=8 workers in a LocalCUDACluster. If I'm reading a large number of compressed files, those 8 workers will be highly bottlenecked by decompression.
Describe the solution you'd like Ideally, we could have fast device side decompression for gzip compressed CSVs.
Describe alternatives you've considered Another solution for dask_cudf could be some logic to make more parallel use of host CPUs for decompression, which should increase throughput t device.
Additional context Per file compression level can be high, such that doing device side decompression, even if faster than CPU, could easily lead to OOM scenarios.
An illustrative dataset for use in exploring this problem is NOAA's daily weather observations:
cc @GregoryKimball