rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.48k stars 906 forks source link

[FEA] Support Polars expression calculating lengths / count from a parquet file #16180

Open beckernick opened 5 months ago

beckernick commented 5 months ago

We can't currently do a read + length/count operation on a parquet file.

import polars as pl
from functools import partial
from cudf_polars.callback import execute_with_cudf
import numpy as np

use_cudf = partial(execute_with_cudf, raise_on_fail=True)

ldf = pl.DataFrame({
    "date": ['2015-09-11', '2017-02-08', '2015-08-01', '2019-03-16', '2015-05-15'],
    "val": [1, 2, 3, 4, 5]
}).lazy()

ldf.sink_parquet("test.parquet")

print(ldf.select(pl.len()).collect())
print(ldf.select(pl.len()).collect(post_opt_callback=use_cudf))
print(pl.scan_parquet("test.parquet").select(pl.len()).collect(post_opt_callback=use_cudf))
shape: (1, 1)
┌─────┐
│ len │
│ --- │
│ u32 │
╞═════╡
│ 5   │
└─────┘
shape: (1, 1)
┌─────┐
│ len │
│ --- │
│ u32 │
╞═════╡
│ 5   │
└─────┘
---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[55], line 17
     15 print(ldf.select(pl.len()).collect())
     16 print(ldf.select(pl.len()).collect(post_opt_callback=use_cudf))
---> 17 print(pl.scan_parquet("test.parquet").select(pl.len()).collect(post_opt_callback=use_cudf))

File [/raid/nicholasb/miniconda3/envs/all_cuda-122_arch-x86_64/lib/python3.11/site-packages/polars/lazyframe/frame.py:1942](http://10.117.23.184:8882/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/all_cuda-122_arch-x86_64/lib/python3.11/site-packages/polars/lazyframe/frame.py#line=1941), in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
   1939 # Only for testing purposes atm.
   1940 callback = _kwargs.get("post_opt_callback")
-> 1942 return wrap_df(ldf.collect(callback))

ComputeError: 'cuda' conversion failed: NotImplementedError: function count

Perhaps this happens because we don't have the FAST COUNT op implemented, which may be a specialized op to take advantage of parquet file metadata for things like row counts?

print(ldf.select(pl.len()).explain())
print(pl.scan_parquet("test.parquet").select(pl.len()).explain())
 SELECT [len()] FROM
  DF ["date", "val"]; PROJECT 1/2 COLUMNS; SELECTION: None
FAST COUNT(*)
  DF []; PROJECT */0 COLUMNS; SELECTION: None
lithomas1 commented 4 months ago

Looks like needs to be exposed on polars side as well as our side (so this won't work out of the box with polars 1.0).

https://github.com/pola-rs/polars/blob/09c98c58d7806d587deb31e53c47e94565bbafc8/py-polars/src/lazyframe/visitor/nodes.rs#L572-L576