pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.3k stars 17.8k forks source link

ENH: Raw `pd.Rolling.apply` with index #44342

Open erezinman opened 2 years ago

erezinman commented 2 years ago

In many cases (especially in non-uniform indexed time-series), one needs to know the index of the series during a rolling operation. While it is possible to obtain the index & values in the non-raw mode, it isn't possible to use the optimizations of the "numba" engine with that.

I propose a new raw-argument mode (say df.rolling('1min').apply(..., raw='with_index', engine="numba")), that assumes a second argument for the index. For example:

def normalized_ptp(values, index_values):
    return values.ptp() / index_values.view('int64').ptp()

time_series.rolling('1min').apply(normalized_ptp, 
                                  raw='with_index', 
                                  engine='numba', 
                                  engine_kwargs=dict(nopython=True)
)
jreback commented 2 years ago

hmm i thought we did this already or maybe that's in table mode

cc @mroeschke

mroeschke commented 2 years ago

This is already the case in groupby.agg/transform with the numba engine but not possible with rolling.apply currently.

As a workaround, you can bring the index into the DataFrame and use method="table" (which always uses numba) to access the index in one column of the 2D numpy array.

Not too thrilled about having raw having another acceptable value as it's also relevant to the cython engine.