pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.67k stars 1.79k forks source link

`polars.Expr.take` returns `null` if `ComputeError: index out of bounds` #8171

Open stevenlis opened 1 year ago

stevenlis commented 1 year ago

Problem description

Sometimes you may have groups with different numbers of rows. It would be nice if polars.Expr.take could just return a null when index out of bounds.

df = pl.DataFrame(
{
        "group": [
            "one",
            "one",
            "one",
            "two",
            "two",
        ],
        "value": [1, 98, 2, 3, 99],
    }
)
df.groupby("group", maintain_order=True).agg(pl.col("value").take(2))

shape: (2, 2)
┌───────┬───────┐
│ group ┆ value │
│ ---   ┆ ---   │
│ str   ┆ i64   │
╞═══════╪═══════╡
│ one   ┆ 2     │
│ two   ┆ null  │
└───────┴───────┘

For now:

df.groupby("group", maintain_order=True).agg(pl.col("value").shift(-2).first())
cmdlineluser commented 1 year ago

.arr.get returns null instead of raising, which may be another option:

>>> df.groupby("group", maintain_order=True).agg(pl.col("value").list().arr.get(2))
shape: (2, 2)
┌───────┬───────┐
│ group ┆ value │
│ ---   ┆ ---   │
│ str   ┆ i64   │
╞═══════╪═══════╡
│ one   ┆ 2     │
│ two   ┆ null  │
└───────┴───────┘
uditrana commented 10 months ago

I would love for take to have this as a flag though... I think its a pretty reasonable and common use case! take syntactically is more canonical for operating on columns instead of treating it as an array imo.

Also seems that list.take has this flag but not take? https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.list.take.html