pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.23k stars 1.95k forks source link

Rolling_mean with weights gives wrong result when min_periods is not None #17344

Open jackaixin opened 4 months ago

jackaixin commented 4 months ago

Checks

Reproducible example

# no exception but gives unexpected results

out = pl.DataFrame({'a':[1,2,3,4,5]}).with_columns(
    pl.col('a').rolling_mean(3, min_periods=0, weights=[1, 2, 3]).alias('rolling_mean_3')
)

print(out)
try:
    pl.DataFrame({'a':[1,2,None,4,5]}).with_columns(
        pl.col('a').rolling_mean(3, min_periods=1, weights=[1, 2, 3]).alias('rolling_mean_3')
    )
except pl.exceptions.PanicException as e:
    print(e)

Log output

weights not yet supported on array with null values

Issue description

When min_periods is set to 0, the first few rolling windows would have length of 1, 2, ..., window_size-1. Currently, it seems that Polars is simply doing $\dfrac{s_1 w_1 + ... + s_l w_l}{w_1 + w_2 + ... w_L}$, where $l=\text{length of window}$, and $L=\text{window size}$. (See output below.)

shape: (5, 2)
┌─────┬────────────────┐
│ a   ┆ rolling_mean_3 │
│ --- ┆ ---            │
│ i64 ┆ f64            │
╞═════╪════════════════╡
│ 1   ┆ 0.166667       │
│ 2   ┆ 0.833333       │
│ 3   ┆ 2.333333       │
│ 4   ┆ 3.333333       │
│ 5   ┆ 4.333333       │
└─────┴────────────────┘

When the rolling window has only the first row, i.e. 1, it's natural to expect that the 'mean' of that rolling period is 1, not 1/6, regardless of the weights.

When the rolling window has $l < L$, it is still to be determined how we should assign the weights to the incomplete rolling window. For example, when the rolling window contains only [1, 2], we can assign weights [1, 2], or [2, 3]. But if we view this from a signal smoothing perspective, [2, 3] might be the better choice.

Also, it seems that when weights is not None, rolling_mean can't handle null values. This is already reported in https://github.com/pola-rs/polars/issues/13771, and I totally agree with @Sage0614 on his implementation. (Having something similar to how ewm_mean handles nulls would be great.) Hopefully that issue could be resolved together with this current issue.

Expected behavior

shape: (5, 2)
┌─────┬────────────────┐
│ a   ┆ rolling_mean_3 │
│ --- ┆ ---            │
│ i64 ┆ f64            │
╞═════╪════════════════╡
│ 1   ┆ 1.0.           │
│ 2   ┆ 1.6.           │
│ 3   ┆ 2.333333       │
│ 4   ┆ 3.333333       │
│ 5   ┆ 4.333333       │
└─────┴────────────────┘

Installed versions

``` --------Version info--------- Polars: 1.0.0 Index type: UInt32 Platform: macOS-14.5-arm64-arm-64bit Python: 3.11.9 (main, Apr 2 2024, 08:25:04) [Clang 15.0.0 (clang-1500.3.9.4)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: great_tables: hvplot: matplotlib: 3.8.2 nest_asyncio: 1.6.0 numpy: 1.26.2 openpyxl: pandas: 2.1.4 pyarrow: 15.0.0 pydantic: pyiceberg: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
Julian-J-S commented 4 months ago

oh yeah, this is tricky 🤔. The "problem" is only at the start of the window

another easy example:

pl.DataFrame({"x": [1, 2, 3, 4]}).with_columns(
    without_weight=pl.col("x").rolling_mean(
        window_size=2,
        min_periods=0,
    ),
    with_weight_50_50=pl.col("x").rolling_mean(
        window_size=2,
        min_periods=0,
        weights=[0.5, 0.5],
    ),
    with_weight_20_80=pl.col("x").rolling_mean(
        window_size=2,
        min_periods=0,
        weights=[0.2, 0.8],
    ),
)

# shape: (4, 4)
# ┌─────┬────────────────┬───────────────────┬───────────────────┐
# │ x   ┆ without_weight ┆ with_weight_50_50 ┆ with_weight_20_80 │
# │ --- ┆ ---            ┆ ---               ┆ ---               │
# │ i64 ┆ f64            ┆ f64               ┆ f64               │
# ╞═════╪════════════════╪═══════════════════╪═══════════════════╡
# │ 1   ┆ 1.0            ┆ 0.5               ┆ 0.2               │ 🤔  window [1]
# │ 2   ┆ 1.5            ┆ 1.5               ┆ 1.8               │ ✅  window [1, 2]
# │ 3   ┆ 2.5            ┆ 2.5               ┆ 2.8               │ ✅ 
# │ 4   ┆ 3.5            ┆ 3.5               ┆ 3.8               │ ✅ 
# └─────┴────────────────┴───────────────────┴───────────────────┘

First window is only [1]

Not sure if this is a bug or intended behaviour and requires more documentation. But one can surely argue that the 50/50 weighted mean of 1 and nothing else is still 1 and not 0.5

jackaixin commented 2 months ago

@Julian-J-S just wondering if this issue is going to be tackled soon. Thanks!

pniskala commented 1 week ago

Hmm, looks like I encountered similar unexpected behaviour for rolling_sum when where the first N-1 elements have incorrect values when specifying min_periods != N and using a weight array of N weights.

Example:

import polars as pl

pl.DataFrame(
    {
        "original": [1., 0., 0.],
    }
).with_columns(
    pl.col("original").rolling_sum(
        min_periods=1,
        window_size=3,
        weights=[1, 2, 3],
    ).alias("rolling_weighted_sum"),
)

# shape: (3, 2)
# ┌──────────┬──────────────────────┐
# │ original ┆ rolling_weighted_sum │
# │ ---      ┆ ---                  │
# │ f64      ┆ f64                  │
# ╞══════════╪══════════════════════╡
# │ 1.0      ┆ 1.0                  │
# │ 0.0      ┆ 1.0                  │
# │ 0.0      ┆ 1.0                  │
# └──────────┴──────────────────────┘

My expected output would be actually 3, 2, 1 in this case but I get 1, 1, 1, almost as if it is doing backwards filling of nulls on the result of min_periods=3 call.

MarcoGorelli commented 2 days ago

Thanks for the report, I agree that this looks quite odd

MarcoGorelli commented 2 days ago

@orlp I have some memory (but can't find the issue) of you saying that weights should be removed from the rolling functions completely - is that the case? Just to decide whether this should be addressed at all

cmdlineluser commented 2 days ago

(Think I found it with commenter:^1)

https://github.com/pola-rs/polars/issues/13966#issuecomment-1908997097

I don't know if this is worth fixing since we'll soon be removing weights anyway.

Sage0614 commented 2 days ago

@MarcoGorelli as @jackaixin mensioned, I proposed a way to efficiently do general rolling statistics with weight #13771 I don't see a reason why rolling function can not have weight