rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.38k stars 896 forks source link

[FEA] Support device-side de/compression of CSV files #12255

Open randerzander opened 1 year ago

randerzander commented 1 year ago

I have a lot of gzip compressed CSV files. When I use cudf to read them, the host handles decompression before copying decompressed data to device.

For cudf, that's not a problem, since it'll at worst be no slower than CPU.

But when I read w/ dask_cudf, compared to CPU dask.dataframe, I will usually have <=8 workers in a LocalCUDACluster. If I'm reading a large number of compressed files, those 8 workers will be highly bottlenecked by decompression.

Describe the solution you'd like Ideally, we could have fast device side decompression for gzip compressed CSVs.

Describe alternatives you've considered Another solution for dask_cudf could be some logic to make more parallel use of host CPUs for decompression, which should increase throughput t device.

Additional context Per file compression level can be high, such that doing device side decompression, even if faster than CPU, could easily lead to OOM scenarios.

An illustrative dataset for use in exploring this problem is NOAA's daily weather observations:

import urllib, os

data_dir = '/raid/weather/csv/'

# download weather observations
base_url = 'ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/'
years = list(range(1763, 2020))
for year in years:
    fn = str(year) + '.csv.gz'
    if not os.path.isfile(data_dir+fn):
        print(f'Downloading {base_url+fn} to {data_dir+fn}')
        urllib.request.urlretrieve(base_url+fn, data_dir+fn) 

cc @GregoryKimball

vuule commented 1 year ago

This is very messy for read_csv, because we do need the data on the host in the current implementation. So this would lead to additional H2D and D2H copies. Other than that, it should be possible to add a parameter to request device-side decompression. Not a trivial change in read_csv, though.