pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.52k stars 1.88k forks source link

expose `null_on_oob` on `gather`. #16842

Open ritchie46 opened 3 months ago

ritchie46 commented 3 months ago

Description

Similar to other get and list.gather, but then on Expr/Series.

Ref https://github.com/pola-rs/polars/issues/15240

cmdlineluser commented 1 month ago

Will this issue also cover Expr.get - or does it need to be opened separately?

There have been a couple of questions in the past few days ^1 with people trying to use .list.get() as a workaround and running into confusion.

ddouglas87 commented 1 month ago

Request: Have expr.get() default to null_on_oob=True. Most real world use cases would want null_on_oob=True. Always having to type that in becomes tedious. I know this breaks backwards compatibility but in this case it probably will not cause issues in preexisting code bases.

deanm0000 commented 1 month ago

It doesn't just break backwards compatibility it goes against the fail early, fail loudly principle. It would also be inconsistent with cast strict=True default.

ddouglas87 commented 1 month ago

Using group_by_dynamic at the end of the rolling aggregate there is always a list size of 1 so .get() will always fail, so one would have to type in null_on_oob=True for all use cases to get it to work. It will not not break preexisting code bases because you have to do null_on_oob=True or it doesn't work at all.

Right now because null_on_oob=True doesn't exist the only way to get an nth element in a rolling aggregate is to do .shift(nth).last(). Shift allows nulls by default.

An alternative solution is to allow rolling aggregates to stop ahead of time instead of unwinding at the end. When I use group_by_dynamic I always chop off the ending. imo the root cause here isn't a lack of null_on_oob=True but that rolling aggregates have no option of stopping early.

dalejung commented 1 month ago

I think the oob-ing on non-null indexes makes sense. The issue is with contexts like aggregations where you can get nulls.

Example: https://github.com/pola-rs/polars/issues/15163

I don't think raising OOB there makes sense.

But that might be discussion for another ticket. Prio should be to at least get an opt out with Expr.get(null_on_oob=False).