Open beckernick opened 5 months ago
We can't currently do a read + length/count operation on a parquet file.
import polars as pl from functools import partial from cudf_polars.callback import execute_with_cudf import numpy as np use_cudf = partial(execute_with_cudf, raise_on_fail=True) ldf = pl.DataFrame({ "date": ['2015-09-11', '2017-02-08', '2015-08-01', '2019-03-16', '2015-05-15'], "val": [1, 2, 3, 4, 5] }).lazy() ldf.sink_parquet("test.parquet") print(ldf.select(pl.len()).collect()) print(ldf.select(pl.len()).collect(post_opt_callback=use_cudf)) print(pl.scan_parquet("test.parquet").select(pl.len()).collect(post_opt_callback=use_cudf)) shape: (1, 1) ┌─────┐ │ len │ │ --- │ │ u32 │ ╞═════╡ │ 5 │ └─────┘ shape: (1, 1) ┌─────┐ │ len │ │ --- │ │ u32 │ ╞═════╡ │ 5 │ └─────┘ --------------------------------------------------------------------------- ComputeError Traceback (most recent call last) Cell In[55], line 17 15 print(ldf.select(pl.len()).collect()) 16 print(ldf.select(pl.len()).collect(post_opt_callback=use_cudf)) ---> 17 print(pl.scan_parquet("test.parquet").select(pl.len()).collect(post_opt_callback=use_cudf)) File [/raid/nicholasb/miniconda3/envs/all_cuda-122_arch-x86_64/lib/python3.11/site-packages/polars/lazyframe/frame.py:1942](http://10.117.23.184:8882/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/all_cuda-122_arch-x86_64/lib/python3.11/site-packages/polars/lazyframe/frame.py#line=1941), in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs) 1939 # Only for testing purposes atm. 1940 callback = _kwargs.get("post_opt_callback") -> 1942 return wrap_df(ldf.collect(callback)) ComputeError: 'cuda' conversion failed: NotImplementedError: function count
Perhaps this happens because we don't have the FAST COUNT op implemented, which may be a specialized op to take advantage of parquet file metadata for things like row counts?
FAST COUNT
print(ldf.select(pl.len()).explain()) print(pl.scan_parquet("test.parquet").select(pl.len()).explain()) SELECT [len()] FROM DF ["date", "val"]; PROJECT 1/2 COLUMNS; SELECTION: None FAST COUNT(*) DF []; PROJECT */0 COLUMNS; SELECTION: None
Looks like needs to be exposed on polars side as well as our side (so this won't work out of the box with polars 1.0).
https://github.com/pola-rs/polars/blob/09c98c58d7806d587deb31e53c47e94565bbafc8/py-polars/src/lazyframe/visitor/nodes.rs#L572-L576
We can't currently do a read + length/count operation on a parquet file.
Perhaps this happens because we don't have the
FAST COUNT
op implemented, which may be a specialized op to take advantage of parquet file metadata for things like row counts?