pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.73k stars 17.95k forks source link

BUG: Partially incorrect results when using a custom indexer for a rolling window for max and min #46726

Open epigramx opened 2 years ago

epigramx commented 2 years ago

Pandas version checks

Reproducible Example

import pandas as pd
from pandas import api
import numpy as np

class MultiWindowIndexer(api.indexers.BaseIndexer):
    def __init__(self, window):
        self.window = np.array(window)
        super().__init__()

    def get_window_bounds(self, num_values, min_periods, center, closed):
        end = np.arange(num_values, dtype='int64') + 1
        start = np.clip(end - self.window, 0, num_values)
        return start, end

np.random.seed([3,14])
a = np.random.randn(20).cumsum()
w = np.minimum(
    np.random.randint(1, 4, size=a.shape),
    np.arange(len(a))+1
)

df = pd.DataFrame({'Data': a, 'Window': w})

df['max1'] = df.Data.rolling(MultiWindowIndexer(df.Window)).max(engine='cython')

print(df)

Issue Description

This method basically tries to use a rolling operation where the window is an arbitrary series of integers instead of an integer or an offset. It is related to question/feature request #46716 and it was originally authored as an answer for a StackOverflow question here. There the author of the method notes on the bug: "The cython implementation seems to remember the largest starting index encountered so far and 'clips' smaller starting indices to the stored value. More technically correct: only stores the range of the largest start and largest end indices encountered so far in a queue, discarding smaller start indices and making them unavailable."

Expected Behavior

The result printed for index 18, should be -1.487828 instead of -1.932612, because at that point the window is 3 and it looks for the max between -1.932612 and -2.539703 and -1.487828,

Installed Versions

commit : 4bfe3d07b4858144c219b9346329027024102ab6 python : 3.8.10.final.0 python-bits : 64 OS : Linux OS-release : 5.10.102.1-microsoft-standard-WSL2 Version : #1 SMP Wed Mar 2 00:30:59 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.4.2 numpy : 1.22.3 pytz : 2021.1 dateutil : 2.8.2 pip : 22.0.4 setuptools : 61.1.1 Cython : None pytest : 7.1.1 hypothesis : None sphinx : None blosc : 1.10.6 feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.1 IPython : None pandas_datareader: None bs4 : 4.10.0 bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None markupsafe : 2.0.1 matplotlib : None numba : None numexpr : 2.7.3 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.8.0 snappy : None sqlalchemy : 1.4.34 tables : 3.7.0 tabulate : 0.8.9 xarray : None xlrd : None xlwt : None zstandard : None
simonjayhawkins commented 2 years ago

Thanks @epigramx for the report.

Expected Behavior

The result printed for index 18, should be -1.487828 instead of -1.932612, because at that point the window is 3 and it looks for the max between -1.932612 and -2.539703 and -1.487828,

Note that on main this now raises ValueError: MultiWindowIndexer does not implement the correct signature for get_window_bounds

rhshadrach commented 2 years ago

@simonjayhawkins - on main, adding the (unused) argument step at the end of get_window_bounds gives the same behavior as reported in OP.

mroeschke commented 2 years ago

I think this bug will be specific to max & min since it doesn't use the traditional sliding window algorithm that most all the other aggregation functions use: https://stackoverflow.com/a/12239580

rhshadrach commented 2 years ago

@mroeschke - I haven't taken a look if the used algorithm can be adapted for arbitrary windows; if not, does it make sense to have two different algorithms (fastpath/slowpath)?

mroeschke commented 2 years ago

I am a little doubtful it can be sharable for other aggregations because IIUC the min/min window algorithm uses value comparisons since it's just looking for min/max

does it make sense to have two different algorithms (fastpath/slowpath)

I suppose so, but not too thrilled about maintaining heuristics when to use fast vs slow in addition to maintaining both algorithms.

We've had precedent for collapsing two different algorithms before trading off performance for the sake of correctness & maintainability, so if going back to the more correct algorithm doesn't incur that much of a performance hit I think that would be worthwhile