pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.44k stars 1.97k forks source link

Improved error message for multiple expressions #7395

Open braaannigan opened 1 year ago

braaannigan commented 1 year ago

Problem description

The following now generates an error message that I think could be improved:

import polars as pl
df = pl.DataFrame({'lag1':[0,1,None],'lag2':[0,None,2]})
df.filter(pl.col('^lag.*$').is_not_null())

ComputeError: The predicate passed to 'LazyFrame.filter' expanded to multiple expressions:

    col("lag1").is_not_null().any(),
    col("lag2").is_not_null().any(),
This is ambiguous. Try to combine the predicates with the 'all' or `any' expression.

As @ghuls pointed out on discord this can be resolved with

df.filter(pl.all(pl.col('^lag.*$').is_not_null()))

I think the error message needs to be more explicit as it wasn't obvious to me how to solve this. It may be easiest to add a link to an example in the python API in which case I could take it on

ritchie46 commented 1 year ago

What is your proposal then? I tried to make it clear by showing which expressions it expands to and give a hint to combine the predicates with all or any.

If we want to explicitly show the original expression we have to clone it first, this would add some latency to the filter expression which I am not sure we need to do because of our users who need very low latency answers.

braaannigan commented 1 year ago

I'm proposing that we:

ritchie46 commented 1 year ago

I don't think we should link to our API docs in code. That can age pretty badly. If do something like that we should make an error registry, but currently that is too much work IMO.

I do think we can add a docstring example. :+1:

ghuls commented 1 year ago

Docstring examples in fliter make the most sense to me.

SydneyUni-Jim commented 2 weeks ago

You now need to use pl.all_horizontal or pl.any_horizontal.

import polars.selectors as cs
df.filter(
  pl.all_horizontal(
    cs.starts_with('lag').is_not_null()
  )
)