intake / intake-parquet

Parquet plugin for Intake
https://intake-parquet.readthedocs.io/en/latest/?badge=latest
BSD 2-Clause "Simplified" License
12 stars 14 forks source link

Add a to_cudf method for reading directly into GPU memory #17

Open weiji14 opened 4 years ago

weiji14 commented 4 years ago

Hi there,

Just wondering if there's scope for a to_cudf type functionality so that users can read Parquet files directly into GPU memory (bypassing the CPU). This would be using the cudf.read_parquet function.

Happy to submit a Pull Request for this, but would like to have a discussion around the implementation, whether it should be handled as a to_cudf method, or via something like engine="cudf" (though cudf also has a "pyarrow" engine like pandas).

One issue though is that cudf cannot read multi-file Parquet folders yet (see https://github.com/rapidsai/cudf/issues/1688), only single binary parquet files. This might get implemented in the future (v0.16?) cudf release though.

martindurant commented 4 years ago

I could see it either way, as an argument to to_pandas (and/or to_dask), or as its own method. How many of the sources do you think it would apply to? I know cuDF have performant parquet and CSV readers.

weiji14 commented 4 years ago

I could see it either way, as an argument to to_pandas (and/or to_dask), or as its own method.

True, since it's possible to have dataframes loaded into a single GPU (ala to_pandas), or multiple GPUs (to_dask). That sounds like. So we could either have:

  1. Something like to_pandas(engine="cudf") and to_dask(engine="cudf")
  2. Something like to_cudf() (which uses cudf.read_parquet) or to_dask_cudf (which uses dask_cudf.read_parquet).

One problem with Option 1 is that the cudf_read_parquet reader has engine="pyarrow" too, and that would conflict with pandas.read_parquet. We could workaround that (using .to_pandas(engine="pyarrow", backend="gpu") but that might get ugly.

How many of the sources do you think it would apply to? I know cuDF have performant parquet and CSV readers.

Looking at cudf's IO readers at https://docs.rapids.ai/api/cudf/stable/api.html#module-cudf.io.csv, these file formats are currently available:

Perhaps we should discuss this upstream at https://github.com/intake/intake too :grin: