Open weiji14 opened 4 years ago
I could see it either way, as an argument to to_pandas (and/or to_dask), or as its own method. How many of the sources do you think it would apply to? I know cuDF have performant parquet and CSV readers.
I could see it either way, as an argument to to_pandas (and/or to_dask), or as its own method.
True, since it's possible to have dataframes loaded into a single GPU (ala to_pandas
), or multiple GPUs (to_dask
). That sounds like. So we could either have:
to_pandas(engine="cudf")
and to_dask(engine="cudf")
to_cudf()
(which uses cudf.read_parquet
) or to_dask_cudf
(which uses dask_cudf.read_parquet
).One problem with Option 1 is that the cudf_read_parquet
reader has engine="pyarrow"
too, and that would conflict with pandas.read_parquet
. We could workaround that (using .to_pandas(engine="pyarrow", backend="gpu")
but that might get ugly.
How many of the sources do you think it would apply to? I know cuDF have performant parquet and CSV readers.
Looking at cudf's IO readers at https://docs.rapids.ai/api/cudf/stable/api.html#module-cudf.io.csv, these file formats are currently available:
Perhaps we should discuss this upstream at https://github.com/intake/intake too :grin:
Hi there,
Just wondering if there's scope for a
to_cudf
type functionality so that users can read Parquet files directly into GPU memory (bypassing the CPU). This would be using thecudf.read_parquet
function.Happy to submit a Pull Request for this, but would like to have a discussion around the implementation, whether it should be handled as a
to_cudf
method, or via something likeengine="cudf"
(thoughcudf
also has a "pyarrow" engine like pandas).One issue though is that
cudf
cannot read multi-file Parquet folders yet (see https://github.com/rapidsai/cudf/issues/1688), only single binary parquet files. This might get implemented in the future (v0.16?)cudf
release though.