[FEA] Support device-side de/compression of CSV files

I have a lot of gzip compressed CSV files. When I use cudf to read them, the host handles decompression before copying decompressed data to device.

For cudf, that's not a problem, since it'll at worst be no slower than CPU.

But when I read w/ dask_cudf, compared to CPU dask.dataframe, I will usually have <=8 workers in a LocalCUDACluster. If I'm reading a large number of compressed files, those 8 workers will be highly bottlenecked by decompression.

Describe the solution you'd like Ideally, we could have fast device side decompression for gzip compressed CSVs.

Describe alternatives you've considered Another solution for dask_cudf could be some logic to make more parallel use of host CPUs for decompression, which should increase throughput t device.

Additional context Per file compression level can be high, such that doing device side decompression, even if faster than CPU, could easily lead to OOM scenarios.

An illustrative dataset for use in exploring this problem is NOAA's daily weather observations:

import urllib, os

data_dir = '/raid/weather/csv/'

# download weather observations
base_url = 'ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/'
years = list(range(1763, 2020))
for year in years:
    fn = str(year) + '.csv.gz'
    if not os.path.isfile(data_dir+fn):
        print(f'Downloading {base_url+fn} to {data_dir+fn}')
        urllib.request.urlretrieve(base_url+fn, data_dir+fn)

cc @GregoryKimball

rapidsai / cudf

[FEA] Support device-side de/compression of CSV files #12255