pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.41k stars 17.83k forks source link

ENH: Support rolling over one level of a MultiIndex #34642

Open miccoli opened 4 years ago

miccoli commented 4 years ago

I had a hard time understanding how df.rolling works when df is indexed by a MultiIndex

This is an example data frame:

import pandas as pd
idx = pd.MultiIndex.from_product(
    [pd.date_range("2020-01-01", "2020-1-10"), ["a", "b"]], names=["date", "obs"],
)
df = pd.DataFrame(index=idx)
df['c1'] = range(len(df))

print(df)

which outputs

                c1
date       obs    
2020-01-01 a     0
           b     1
2020-01-02 a     2
           b     3
2020-01-03 a     4
           b     5
2020-01-04 a     6
           b     7
2020-01-05 a     8
           b     9
2020-01-06 a    10
           b    11
2020-01-07 a    12
           b    13
2020-01-08 a    14
           b    15
2020-01-09 a    16
           b    17
2020-01-10 a    18
           b    19

Now I want to apply a rolling window on the date level, keeping the obs level separate.

I tried with no success obvious and simple (least surprise) commands like

but finally the desired result is obtained by

df_r = df.groupby(by="obs", group_keys=False).rolling(
    "7d", on=df.index.levels[0]
).mean().sort_index()

print(df_r)

which gives me the correct result:

                  c1
date       obs      
2020-01-01 a     0.0
           b     1.0
2020-01-02 a     1.0
           b     2.0
2020-01-03 a     2.0
           b     3.0
2020-01-04 a     3.0
           b     4.0
2020-01-05 a     4.0
           b     5.0
2020-01-06 a     5.0
           b     6.0
2020-01-07 a     6.0
           b     7.0
2020-01-08 a     8.0
           b     9.0
2020-01-09 a    10.0
           b    11.0
2020-01-10 a    12.0
           b    13.0

It seams to me that this should be a quite common situation, so I was wondering if there is a simpler way to obtain the same results. By the way my solution is not very robust, because there are hidden assumptions on how the objects returned by groupby are indexed, which do not necessarily hold for a generic data frame.

Moreover the doc of the on parameter in rolling was almost incomprehensible to me: I'm still wondering if my usage rolling( "7d", on=df.index.levels[0]) is the intended one or not.

daskol commented 4 years ago

Probably, there is a related issue #34617. By the ways, if data frame is indexed with integer index then one can use level keyword of .rolling(). This example is not consistent and just shows API features.

import pandas as pd
idx = pd.MultiIndex.from_product(
    [range(7), ["a", "b"]], names=["date", "obs"],
)
df = pd.DataFrame(index=idx)
df['c1'] = range(len(df))

print(df)

df_r = df.groupby(by="obs", group_keys=False).rolling(
    7, level=1
).mean().sort_index()

print(df_r)
miccoli commented 4 years ago

In @daskol example the level keyword seems a noop, as any other arg passed to rolling.

from pandas.testing import assert_frame_equal

assert_frame_equal(
    df.groupby(by="obs", group_keys=False).rolling(7).mean(),
    df.groupby(by="obs", group_keys=False).rolling(7, level=1).mean(),
)
assert_frame_equal(
    df.groupby(by="obs", group_keys=False).rolling(7).mean(),
    df.groupby(by="obs", group_keys=False).rolling(7, foo=8, bar='8').mean(),
)

I'm still very confused.

miccoli commented 4 years ago

A brief update. After updating pandas to v1.1.1 the above walk around fails with SEGFAULT!

>>> df.groupby(by="obs", group_keys=False).rolling("7d", on=df.index.levels[0])
RollingGroupby [window=604800000000000,min_periods=1,center=False,win_type=freq,axis=0,on=DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10'],
              dtype='datetime64[ns]', name='date', freq=None)]

but

>>> df.groupby(by="obs", group_keys=False).rolling("7d", on=df.index.levels[0]).mean()
Segmentation fault: 11

It seems to me that this issue should be labeled as BUG

pandas version info ``` INSTALLED VERSIONS ------------------ commit : f2ca0a2665b2d169c97de87b8e778dbed86aea07 python : 3.8.2.final.0 python-bits : 64 OS : Darwin OS-release : 19.6.0 Version : Darwin Kernel Version 19.6.0: Thu Jun 18 20:49:00 PDT 2020; root:xnu-6153.141.1~1/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8 pandas : 1.1.1 numpy : 1.19.1 pytz : 2020.1 dateutil : 2.8.1 pip : 20.2.2 setuptools : 49.6.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.17.0 pandas_datareader: None bs4 : None bottleneck : None fsspec : 0.8.0 fastparquet : None gcsfs : None matplotlib : 3.3.1 numexpr : 2.7.1 odfpy : None openpyxl : 3.0.3 pandas_gbq : None pyarrow : 1.0.1 pytables : None pyxlsb : None s3fs : 0.5.0 scipy : 1.5.2 sqlalchemy : None tables : 3.6.1 tabulate : 0.8.7 xarray : 0.16.0 xlrd : 1.2.0 xlwt : None numba : 0.51.1 ```
miccoli commented 4 years ago

SEGFAULT possibly linked to #36018

drlakecoder commented 2 years ago

hi, I wonder if there were any update on this issue, in my situation, I am also unable to do rolling with non-integer windows when there is multi-index in the dataframe. With the simple example that was used in this thread, I would get different error like the following:

df.groupby(by="obs", group_keys=False).rolling("7d", on=df.index.levels[0]) Out[21]: RollingGroupby [window=7d,min_periods=1,center=False,win_type=freq,axis=0,on=DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08', '2020-01-09', '2020-01-10'], dtype='datetime64[ns]', name='date', freq='D'),method=single]

df.groupby(by="obs", group_keys=False).rolling("7d", on=df.index.levels[0]).mean() Traceback (most recent call last):

File "C:\Users\c_yy\AppData\Local\Temp/ipykernel_15476/3165508401.py", line 1, in df.groupby(by="obs", group_keys=False).rolling("7d", on=df.index.levels[0]).mean()

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 1855, in mean return super().mean(*args, engine=engine, engine_kwargs=engine_kwargs, **kwargs)

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 1309, in mean return self._apply(window_func, name="mean", **kwargs)

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 594, in _apply result = super()._apply(

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 545, in _apply return self._apply_blockwise(homogeneous_func, name)

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 441, in _apply_blockwise new_mgr = mgr.apply(hfunc, ignore_failures=True)

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 325, in apply applied = b.apply(f, **kwargs)

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\internals\blocks.py", line 381, in apply result = func(self.values, **kwargs)

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 431, in hfunc res_values = homogeneous_func(values)

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 535, in homogeneous_func result = np.apply_along_axis(calc, self.axis, values)

File "<__array_function__ internals>", line 5, in apply_along_axis

File "C:\Users\c_yy\anaconda3\lib\site-packages\numpy\lib\shape_base.py", line 379, in apply_along_axis res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 521, in calc start, end = window_indexer.get_window_bounds(

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\indexers.py", line 333, in get_window_bounds index_array = self.index_array.take(ensure_platform_int(indices))

IndexError: index 10 is out of bounds for axis 0 with size 10

When using my own real data set (see below), I get different error:

rxx.loc['2015-12-01':'2017-12-29',:] TotRet SpecRet ... volumedays IsTrading date order_book_id ...
2015-12-01 000155.XSHE 0.014580 0.007206 ... 53.0 1.0 2015-12-02 000155.XSHE -0.004785 0.002132 ... 53.0 1.0 2015-12-03 000155.XSHE 0.034894 0.006537 ... 53.0 1.0 2015-12-04 000155.XSHE 0.050001 0.021692 ... 53.0 1.0 2015-12-07 000155.XSHE 0.049832 0.026999 ... 53.0 1.0 ... ... ... ... ... 2017-12-25 000155.XSHE -0.049707 -0.053313 ... 5.0 1.0 2017-12-26 000155.XSHE -0.023771 -0.013894 ... 6.0 1.0 2017-12-27 000155.XSHE 0.006497 0.014989 ... 7.0 1.0 2017-12-28 000155.XSHE -0.030648 -0.026356 ... 8.0 1.0 2017-12-29 000155.XSHE 0.003335 -0.005979 ... 9.0 1.0

[104 rows x 112 columns]

rxx.groupby(level='order_book_id',group_keys=False).rolling('63d',on=rxx.index.levels[0]).mean() *** ValueError: Length of values (567) does not match length of index (150)

pd.show_versions()

INSTALLED VERSIONS

commit : 945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5 python : 3.8.8.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22000 machine : AMD64 processor : AMD64 Family 25 Model 33 Stepping 0, AuthenticAMD byteorder : little LC_ALL : None LANG : en LOCALE : English_United States.1252

pandas : 1.3.4 numpy : 1.21.2 pytz : 2021.3 dateutil : 2.8.2 pip : 21.2.2 setuptools : 58.0.4 Cython : 0.29.24 pytest : 6.2.4 hypothesis : None sphinx : 4.2.0 blosc : None feather : None xlsxwriter : 3.0.1 lxml.etree : 4.6.3 html5lib : 1.1 pymysql : 1.0.2 psycopg2 : None jinja2 : 2.11.3 IPython : 7.29.0 pandas_datareader: None bs4 : 4.10.0 bottleneck : 1.3.2 fsspec : 2021.08.1 fastparquet : None gcsfs : None matplotlib : 3.4.3 numexpr : 2.7.3 odfpy : None openpyxl : 3.0.9 pandas_gbq : None pyarrow : 6.0.0 pyxlsb : None s3fs : None scipy : 1.7.1 sqlalchemy : 1.4.22 tables : 3.6.1 tabulate : 0.8.9 xarray : None xlrd : 2.0.1 xlwt : 1.3.0 numba : 0.53.1

jpodivin commented 2 years ago

I have to concur. This seems like a useful thing to have. Or if there is an alternative way to handle this use case it could be described in docs.

d-otto commented 1 year ago

I've run into this issue as well.

The documentation for df.rolling() states on= should be: "a column label or Index level on which to calculate the rolling window". My expectation was that I could pass the name of a multiindex level and .rolling() would group rows by unique index level values. This all might be better handled by .groupby(), but I'd love to see more integrated multiindex handling where convenient.

pollackscience commented 1 year ago

I have a similar issue, which again involves performing rolling actions on a dataframe with duplicate datetimes. I've gotten a complicated workaround using groupby and some other hacks, but a more native support would be greatly appreciated.

Kalaweksh commented 1 year ago

The groupby transform method has this exact functionality. The docs explain it pretty well. Although I agree that the windowing API should definitely natively have this functionality.

claudia-buss commented 1 year ago

This is the solution wanted https://www.appsloveworld.com/pandas/100/251/pandas-dataframe-rolling-window-with-multiindex

meta-ks commented 10 months ago

The doc still mentions this as funtctionality but it doesn't work:

src: Rolling doc

For a DataFrame, a column label or **Index level** on which to calculate the rolling window, rather than the DataFrame’s index.
jdkworld commented 9 months ago

The doc still mentions this as funtctionality but it doesn't work:

src: Rolling doc

For a DataFrame, a column label or **Index level** on which to calculate the rolling window, rather than the DataFrame’s index.

This is still an issue

zdwhite commented 4 months ago

The doc still mentions this as functionality but it doesn't work:

src: Rolling doc

For a DataFrame, a column label or **Index level** on which to calculate the rolling window, rather than the DataFrame’s index.

This is still an issue however there is a hacky(?) work around.

Consider the following multi indexed series

 test.info()
<class 'pandas.core.series.Series'>
MultiIndex: 5442161 entries, ('1A1', Timestamp('2024-04-07 17:00:00.431000+0000', tz='UTC')) to ('3B9', Timestamp('2024-04-> 28 16:59:58.468000+0000', tz='UTC'))
Series name: supplytemp
Non-Null Count    Dtype  
--------------    -----  
5442159 non-null  float64
dtypes: float64(1)

test.index.names
FrozenList(['site', 'TimeStamp'])

Test['rolling_avg_supply_temp'] = Test.groupby(level='site').rolling('1h',on=Test.index.get_level_values(level='TimeStamp'))['supplytemp'].mean()

While this is a tad more verbose then pandas just acknowledging the closest index this is a workable replacement. The descriptive statistics for the series line up with this method and the non indexed method.