Closed FBruzzesi closed 4 days ago
wow, this is clever! do you have an idea for how to do the mean / min / max cases?
Love the creativity here
I tried running this hypothesis test
from hypothesis import given
import hypothesis.strategies as st
import pyarrow as pa
import pandas as pd
@given(
center = st.booleans(),
values = st.lists(st.floats(-10, 10), min_size=3, max_size=10),
)
@pytest.mark.filterwarnings('ignore:.*:narwhals.exceptions.NarwhalsUnstableWarning')
def test_rolling_sum_hypothesis(center: bool, values: list[float]) -> None:
s = pd.Series(values)
n_missing = random.randint(0, len(s)-1)
window_size = random.randint(1, len(s))
min_periods = random.randint(0, window_size)
mask = random.sample(range(len(s)), n_missing)
s[mask] = None
df = pd.DataFrame({'a': s})
expected = s.rolling(window=window_size, center=center, min_periods=min_periods).sum().to_frame('a')
result = nw.from_native(pa.Table.from_pandas(df)).select(nw.col('a').rolling_sum(window_size, center=center, min_periods=min_periods))
expected_dict = nw.from_native(expected, eager_only=True).to_dict(as_series=False)
assert_equal_data(result, expected_dict)
and it's picking up some small inconsistencies:
In [19]: s
Out[19]:
0 0.0
1 NaN
2 0.0
dtype: float64
In [20]: s.rolling(min_periods=0, center=False, window=2).sum()
Out[20]:
0 0.0
1 0.0
2 0.0
dtype: float64
In [21]: nw.from_native(pa.chunked_array([s]), series_only=True).rolling_sum(min_periods=0, center=False, window_size=2).to_native()
Out[21]:
<pyarrow.lib.ChunkedArray object at 0x7f7cd5793ee0>
[
[
null,
null,
null
]
]
In [22]: nw.from_native(pl.from_pandas(s), series_only=True).rolling_sum(min_periods=0, center=False, window_size=2).to_native()
Out[22]:
shape: (3,)
Series: '' [f64]
[
0.0
0.0
0.0
]
I tried running this hypothesis test
and it's picking up some small inconsistencies
Well ok min_periods = min_periods or window_size
evaluates to 2, because of 0 or 2
. Let me adjust
I would tend towards restricting the API to:
window_size
strictly positivemin_periods
either None or strictly positiveand raise in other cases.
Edit: old polars seems to break if that's not the case.
thanks - shall we also include the hypothesis test?
What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below
So I wanted to start with
sum
, assuming it would have been simpler for arrow to implement in a way which is not naive.Running some benchmarks, performances of this with 1M rows is in the same ballpark of pandas, while the naive way in #1290 is orders of magnitude slower. I would consider this the way to go forward.