Open jackaixin opened 4 months ago
oh yeah, this is tricky 🤔. The "problem" is only at the start of the window
another easy example:
pl.DataFrame({"x": [1, 2, 3, 4]}).with_columns(
without_weight=pl.col("x").rolling_mean(
window_size=2,
min_periods=0,
),
with_weight_50_50=pl.col("x").rolling_mean(
window_size=2,
min_periods=0,
weights=[0.5, 0.5],
),
with_weight_20_80=pl.col("x").rolling_mean(
window_size=2,
min_periods=0,
weights=[0.2, 0.8],
),
)
# shape: (4, 4)
# ┌─────┬────────────────┬───────────────────┬───────────────────┐
# │ x ┆ without_weight ┆ with_weight_50_50 ┆ with_weight_20_80 │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ f64 ┆ f64 │
# ╞═════╪════════════════╪═══════════════════╪═══════════════════╡
# │ 1 ┆ 1.0 ┆ 0.5 ┆ 0.2 │ 🤔 window [1]
# │ 2 ┆ 1.5 ┆ 1.5 ┆ 1.8 │ ✅ window [1, 2]
# │ 3 ┆ 2.5 ┆ 2.5 ┆ 2.8 │ ✅
# │ 4 ┆ 3.5 ┆ 3.5 ┆ 3.8 │ ✅
# └─────┴────────────────┴───────────────────┴───────────────────┘
First window is only [1]
Not sure if this is a bug or intended behaviour and requires more documentation.
But one can surely argue that the 50/50 weighted mean of 1
and nothing else is still 1
and not 0.5
@Julian-J-S just wondering if this issue is going to be tackled soon. Thanks!
Hmm, looks like I encountered similar unexpected behaviour for rolling_sum
when where the first N-1 elements have incorrect values when specifying min_periods != N
and using a weight array of N weights.
Example:
import polars as pl
pl.DataFrame(
{
"original": [1., 0., 0.],
}
).with_columns(
pl.col("original").rolling_sum(
min_periods=1,
window_size=3,
weights=[1, 2, 3],
).alias("rolling_weighted_sum"),
)
# shape: (3, 2)
# ┌──────────┬──────────────────────┐
# │ original ┆ rolling_weighted_sum │
# │ --- ┆ --- │
# │ f64 ┆ f64 │
# ╞══════════╪══════════════════════╡
# │ 1.0 ┆ 1.0 │
# │ 0.0 ┆ 1.0 │
# │ 0.0 ┆ 1.0 │
# └──────────┴──────────────────────┘
My expected output would be actually 3, 2, 1 in this case but I get 1, 1, 1, almost as if it is doing backwards filling of nulls on the result of min_periods=3
call.
Thanks for the report, I agree that this looks quite odd
@orlp I have some memory (but can't find the issue) of you saying that weights should be removed from the rolling functions completely - is that the case? Just to decide whether this should be addressed at all
(Think I found it with commenter:
^1)
https://github.com/pola-rs/polars/issues/13966#issuecomment-1908997097
I don't know if this is worth fixing since we'll soon be removing weights anyway.
Checks
Reproducible example
Log output
Issue description
When
min_periods
is set to 0, the first few rolling windows would have length of 1, 2, ..., window_size-1. Currently, it seems that Polars is simply doing $\dfrac{s_1 w_1 + ... + s_l w_l}{w_1 + w_2 + ... w_L}$, where $l=\text{length of window}$, and $L=\text{window size}$. (See output below.)When the rolling window has only the first row, i.e. 1, it's natural to expect that the 'mean' of that rolling period is 1, not 1/6, regardless of the weights.
When the rolling window has $l < L$, it is still to be determined how we should assign the weights to the incomplete rolling window. For example, when the rolling window contains only [1, 2], we can assign weights [1, 2], or [2, 3]. But if we view this from a signal smoothing perspective, [2, 3] might be the better choice.
Also, it seems that when
weights
is not None,rolling_mean
can't handlenull
values. This is already reported in https://github.com/pola-rs/polars/issues/13771, and I totally agree with @Sage0614 on his implementation. (Having something similar to howewm_mean
handles nulls would be great.) Hopefully that issue could be resolved together with this current issue.Expected behavior
Installed versions