[BUG] Polars combined filter and head throws error with some datasets (row limit in scan error)

beckernick commented 3 months ago

The following filter + head operation works smoothly:

import polars as pl
from functools import partial
from cudf_polars.callback import execute_with_cudf
import numpy as np

ldf = pl.DataFrame({"a": ["a", "b","c", "c", "b"]}).lazy()
ldf.select(pl.col("a") == "b").head().collect(post_opt_callback=partial(execute_with_cudf, raise_on_fail=True)) # works as expected

It also works on a larger dataset:

import polars as pl
from functools import partial
from cudf_polars.callback import execute_with_cudf
import numpy as np

N = 100000000
K = 20

ldf = pl.DataFrame({
    "a": np.random.choice(K, N),
    "b": np.random.choice(K, N),
    "c": np.random.choice(K, N),
}).lazy()

ldf.select(pl.col("b") == 11).head().collect(post_opt_callback=partial(execute_with_cudf, raise_on_fail=True))

But, when I use a different dataset, I get an error (apologies for non-accessible dataset path):

transactions = pl.scan_csv("/raid/manass/cudf/data/half_transactions.csv")

transactions.select(pl.col("DAY") == 11).head().collect(
    post_opt_callback=partial(execute_with_cudf, raise_on_fail=True)
)
---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[22], line 3
      1 transactions = pl.scan_csv("[/raid/manass/cudf/data/half_transactions.csv](http://10.117.23.184:8882/lab/tree/raid/raid/manass/cudf/data/half_transactions.csv)")
----> 3 transactions.select(pl.col("DAY") == 11).head().collect(
      4     post_opt_callback=partial(execute_with_cudf, raise_on_fail=True)
      5 )

File [/raid/nicholasb/miniconda3/envs/all_cuda-122_arch-x86_64/lib/python3.11/site-packages/polars/lazyframe/frame.py:1942](http://10.117.23.184:8882/lab/tree/raid/raid/nicholasb/miniconda3/envs/all_cuda-122_arch-x86_64/lib/python3.11/site-packages/polars/lazyframe/frame.py#line=1941), in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
   1939 # Only for testing purposes atm.
   1940 callback = _kwargs.get("post_opt_callback")
-> 1942 return wrap_df(ldf.collect(callback))

ComputeError: 'cuda' conversion failed: NotImplementedError: row limit in scan

But the filter on its own works:

transactions = pl.scan_csv("/raid/manass/cudf/data/half_transactions.csv")

transactions.select(pl.col("DAY") == 11).collect(
    post_opt_callback=partial(execute_with_cudf, raise_on_fail=True)
)

lithomas1 commented 3 months ago

Looks like from here https://github.com/rapidsai/cudf/blob/1a4c2aa38c6e7de8c6937b787a1263a4ccddadea/python/cudf_polars/cudf_polars/dsl/ir.py#L204-L205

I'll take a look at this as part of my I/O work (this should be enable-able for csv, and for parquet, we need to do a little bit of work to expose the option from cython/pylibcudf)

wence- commented 3 months ago

I'll take a look at this as part of my I/O work (this should be enable-able for csv, and for parquet, we need to do a little bit of work to expose the option from cython/pylibcudf)

@lithomas1 If you think this will not be a day or two, can you open a PR that moves this notimplemented error into the __post_init__ method of Scan?

lithomas1 commented 3 months ago

I'll take a look at this as part of my I/O work (this should be enable-able for csv, and for parquet, we need to do a little bit of work to expose the option from cython/pylibcudf)

@lithomas1 If you think this will not be a day or two, can you open a PR that moves this notimplemented error into the __post_init__ method of Scan?

I should be able to get this in the next couple of days (it should be pretty easy to expose the bindings, but the full migration to pylibcudf is pretty far down the stack).

rapidsai / cudf

[BUG] Polars combined filter and head throws error with some datasets (row limit in scan error) #16172