pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.66k stars 1.79k forks source link

group_by_dynamic with offset raises OutOfBoundsError when using `pl.Expr.get` #15091

Closed dalejung closed 1 week ago

dalejung commented 5 months ago

Checks

Reproducible example

dates = pl.date_range(
    pl.datetime(2023, 1, 1),
    pl.datetime(2023, 12, 31),
    eager=True,
)
df = pl.DataFrame({
    'date': dates,
}).with_columns(
    num=pl.int_range(pl.len())
)

assert isinstance(df, pl.DataFrame)

grouped = df.group_by_dynamic(
    'date',
    every='1mo',
    offset="20d",
    include_boundaries=True,
)

Note that each groups has more than 2 rows.


report = grouped.agg(
    pl.len(),
)

# print(report)
#     _lower_boundary      _upper_boundary                 date  len
# 0   2023-01-21 00:00:00  2023-02-21 00:00:00  2023-01-21 00:00:00   31
# 1   2023-02-21 00:00:00  2023-03-21 00:00:00  2023-02-21 00:00:00   28
# 2   2023-03-21 00:00:00  2023-04-21 00:00:00  2023-03-21 00:00:00   31
# 3   2023-04-21 00:00:00  2023-05-21 00:00:00  2023-04-21 00:00:00   30
# 4   2023-05-21 00:00:00  2023-06-21 00:00:00  2023-05-21 00:00:00   31
# 5   2023-06-21 00:00:00  2023-07-21 00:00:00  2023-06-21 00:00:00   30
# 6   2023-07-21 00:00:00  2023-08-21 00:00:00  2023-07-21 00:00:00   31
# 7   2023-08-21 00:00:00  2023-09-21 00:00:00  2023-08-21 00:00:00   31
# 8   2023-09-21 00:00:00  2023-10-21 00:00:00  2023-09-21 00:00:00   30
# 9   2023-10-21 00:00:00  2023-11-21 00:00:00  2023-10-21 00:00:00   31
# 10  2023-11-21 00:00:00  2023-12-21 00:00:00  2023-11-21 00:00:00   30
# 11  2023-12-21 00:00:00  2024-01-21 00:00:00  2023-12-21 00:00:00   11

Both aggregations below causes: OutOfBoundsError: gather indices are out of bounds


bad_report = grouped.agg(
    pl.col('num').get(
        pl.col('num').arg_min(),
    ).alias('num_get'),
)

# hard coded 1 also raises error though though each group has more than 2 members.
bad_hard_coded_report = grouped.agg(
    pl.col('num').get(
        1,
    ).alias('num_get'),
)

Log output

No response

Issue description

Getting an OutOfBoundsError when using group_by_dynamic with offset. As show in the code example, each of the subgroups has at least 2 members and yet pl.Expr.get(1) returns an error.

Expected behavior

I would expect pl.col(col).get(pl.col(col).arg_min()) to not error. By definition arg_min should provide a valid index to get.

Installed versions

``` --------Version info--------- Polars: 0.20.16-rc.1 Index type: UInt32 Platform: Linux-6.7.8-arch1-1-x86_64-with-glibc2.39 Python: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fastexcel: fsspec: 2024.2.0 gevent: hvplot: 0.9.2.post8+g4cb29ba matplotlib: 3.8.3 numpy: 1.26.4 openpyxl: 3.1.2 pandas: 3.0.0.dev0+432.g5bcc7b7077 pyarrow: 16.0.0.dev248+g120a1dbec pydantic: 1.10.13 pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
mcrumiller commented 5 months ago

Just to help anyone looking into this, it looks like the issue happens once the offset passes the halfway point through the every parameter.

dalejung commented 5 months ago
    grouped = df.group_by_dynamic(
        'date',
        every='1mo',
        offset='10d',
    )

    report = grouped.agg(
        pl.col('num').first().alias('num_first'),
        pl.col('num').min().alias('num_min'),
        pl.col('num').get(
            pl.col('num').arg_min(),
        ).alias('num_get'),
    )
                   date  num_first  num_min  num_get
0   2023-01-11 00:00:00         10       10       20
1   2023-02-11 00:00:00         41       41       51
2   2023-03-11 00:00:00         69       69       79
3   2023-04-11 00:00:00        100      100      110
4   2023-05-11 00:00:00        130      130      140
5   2023-06-11 00:00:00        161      161      171
6   2023-07-11 00:00:00        191      191      201
7   2023-08-11 00:00:00        222      222      232
8   2023-09-11 00:00:00        253      253      263
9   2023-10-11 00:00:00        283      283      293
10  2023-11-11 00:00:00        314      314      324
11  2023-12-11 00:00:00        344      344      354

This is similar to https://github.com/pola-rs/polars/issues/15078 actually. When offset does not error, it returns incorrect results.

num_get column should be the same as num_first and num_min

deanm0000 commented 4 months ago

I'm not sure what's happening but a workaround is to just let it resolve to a List from the agg and then use the list namespace in a separate with_columns

So, instead of:

grouped.agg(
    pl.col('num').get(
        pl.col('num').arg_min(),
    ).alias('num_get'),
)

do

grouped.agg(
    pl.col('num').alias('num_get')
).with_columns(
    pl.col('num_get').list.get(pl.col('num_get').list.arg_min())
)

I guessed this might be unrelated to group_by_dynamic so tested that with:

 (df
.with_columns(z=pl.col('date').dt.offset_by("-20d"))
.group_by(pl.col('z').dt.truncate('1mo'))
.agg(
    pl.col('date').first().alias('_lower_bound'),
    pl.col('date').last().alias('_upper_bound')+pl.duration(days=1),
    pl.col('date').first(),
    pl.col('num').get(pl.col('num').arg_min()).name.suffix("_get")
    )
.drop('z')
)

but that does work so apparently it is specific to group_by_dynamic and not agg

itamarst commented 1 month ago

Going to see if I can figure this out.

itamarst commented 1 month ago

Using a slightly different reproducer:

import polars as pl

df = (
    pl.DataFrame({"idx": [0, 1, 2], "x": [3, 4, 5]})
    .group_by_dynamic(
        "idx",
        every="1i",
        offset="0i",  # '1i' or '2i' has no effect on the window
        period="2i",
    )
    .agg(
        pl.when(pl.col("x").count() > 1)
        .then(pl.col("x").get(1))
        .otherwise(pl.col("x").get(0) + 100)
    )  # throws
)
print(df)

Here's the problem, from crates/polars-expr/src/expressions/ternary.rs:

impl PhysicalExpr for TernaryExpr {

    #[allow(clippy::ptr_arg)]
    fn evaluate_on_groups<'a>(
        &self,
        df: &DataFrame,
        groups: &'a GroupsProxy,
        state: &ExecutionState,
    ) -> PolarsResult<AggregationContext<'a>> {
        let op_mask = || self.predicate.evaluate_on_groups(df, groups, state);
        let op_truthy = || self.truthy.evaluate_on_groups(df, groups, state);
        let op_falsy = || self.falsy.evaluate_on_groups(df, groups, state);
        let (ac_mask, (ac_truthy, ac_falsy)) = if self.run_par {
            POOL.install(|| rayon::join(op_mask, || rayon::join(op_truthy, op_falsy)))
        } else {
            (op_mask(), (op_truthy(), op_falsy()))
        };
       // ...
}

Notice that the truthy and falsy expressions are called on every single group, regardless of the predicate.

The problem is more likely I guess on group_by_dynamic because you can get groups of different sizes. In my Python example above, the final group is length 1. So the .get(1) gets called on a list that is length 1.

So first, this leads to bugs. Second, this seems like a potential performance impact, because instead of just running a single expression per value, both the truthy and falsy expressions are run for every single value, and then half the results are just thrown out.

itamarst commented 1 month ago

evaluate() seems to have a similar issue.

itamarst commented 1 month ago

So one alternative implementation is wrapping truthy expression into "truthy(val) if predicate(val) else null" and equivalent for falsy expression. Not sure how this works when falsy is e.g. a literal, though (how would you know in advance? which is maybe why it's done the way it is).

Another option is somehow preserving the error so it's tied to a specific index, rather than a single error cancelling processing altogether (instead of Result<Vec<T>> use Vec<Result<T>>, except I'm not sure that actually possible in this case, given it's a Series not a Vec).

itamarst commented 1 month ago

Oh my reproducer was buggy, will fix in original comment. (And also upgrade to main). To clarify: I pasted the wrong script, you want .get(1) in the then().

And I can still reproduce with that on main.

itamarst commented 1 month ago

Ah, this behavior is actually documented: https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.when.html#polars.when

So in some sense not a bug, but definitely a significant limitation given people keep hitting this.

dalejung commented 1 month ago

Just to clarify, you're talking about your use of when and the fact that it'll compute all expressions.

My original bug isn't documented behavior.

itamarst commented 1 month ago

I'll retake a look then and see if it's the same or not.

itamarst commented 1 month ago

@dalejung are you still seeing the original problem? I can't reproduce with any of your reproducer scripts (neither errors nor wrong results where num_get isn't the same as num_first and num_min). So if this is fixed maybe it should be closed and I can open a new issue for precalculation.

dalejung commented 1 week ago

@itamarst cannot replicate anymore. closing