narwhals-dev / narwhals

Lightweight and extensible compatibility layer between dataframe libraries!
https://narwhals-dev.github.io/narwhals/
MIT License
607 stars 90 forks source link

feat: add `Series|Expr.rolling_sum` method #1395

Closed FBruzzesi closed 4 days ago

FBruzzesi commented 5 days ago

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below

So I wanted to start with sum, assuming it would have been simpler for arrow to implement in a way which is not naive.

Running some benchmarks, performances of this with 1M rows is in the same ballpark of pandas, while the naive way in #1290 is orders of magnitude slower. I would consider this the way to go forward.

FBruzzesi commented 4 days ago

wow, this is clever! do you have an idea for how to do the mean / min / max cases?

MarcoGorelli commented 4 days ago

Love the creativity here

I tried running this hypothesis test

from hypothesis import given
import hypothesis.strategies as st
import pyarrow as pa
import pandas as pd

@given(
    center = st.booleans(),
    values = st.lists(st.floats(-10, 10), min_size=3, max_size=10),
)
@pytest.mark.filterwarnings('ignore:.*:narwhals.exceptions.NarwhalsUnstableWarning')
def test_rolling_sum_hypothesis(center: bool, values: list[float]) -> None:
    s = pd.Series(values)
    n_missing = random.randint(0, len(s)-1)
    window_size = random.randint(1, len(s))
    min_periods = random.randint(0, window_size)
    mask = random.sample(range(len(s)), n_missing)
    s[mask] = None
    df = pd.DataFrame({'a': s})
    expected = s.rolling(window=window_size, center=center, min_periods=min_periods).sum().to_frame('a')
    result = nw.from_native(pa.Table.from_pandas(df)).select(nw.col('a').rolling_sum(window_size, center=center, min_periods=min_periods))
    expected_dict = nw.from_native(expected, eager_only=True).to_dict(as_series=False)
    assert_equal_data(result, expected_dict)

and it's picking up some small inconsistencies:

In [19]: s
Out[19]: 
0    0.0
1    NaN
2    0.0
dtype: float64

In [20]: s.rolling(min_periods=0, center=False, window=2).sum()
Out[20]: 
0    0.0
1    0.0
2    0.0
dtype: float64

In [21]: nw.from_native(pa.chunked_array([s]), series_only=True).rolling_sum(min_periods=0, center=False, window_size=2).to_native()
Out[21]: 
<pyarrow.lib.ChunkedArray object at 0x7f7cd5793ee0>
[
  [
    null,
    null,
    null
  ]
]

In [22]: nw.from_native(pl.from_pandas(s), series_only=True).rolling_sum(min_periods=0, center=False, window_size=2).to_native()
Out[22]: 
shape: (3,)
Series: '' [f64]
[
        0.0
        0.0
        0.0
]
FBruzzesi commented 4 days ago

I tried running this hypothesis test

and it's picking up some small inconsistencies

Well ok min_periods = min_periods or window_size evaluates to 2, because of 0 or 2. Let me adjust

FBruzzesi commented 4 days ago

I would tend towards restricting the API to:

and raise in other cases.

Edit: old polars seems to break if that's not the case.

MarcoGorelli commented 4 days ago

thanks - shall we also include the hypothesis test?