Open miccoli opened 4 years ago
Probably, there is a related issue #34617. By the ways, if data frame is indexed with integer index then one can use level
keyword of .rolling()
. This example is not consistent and just shows API features.
import pandas as pd
idx = pd.MultiIndex.from_product(
[range(7), ["a", "b"]], names=["date", "obs"],
)
df = pd.DataFrame(index=idx)
df['c1'] = range(len(df))
print(df)
df_r = df.groupby(by="obs", group_keys=False).rolling(
7, level=1
).mean().sort_index()
print(df_r)
In @daskol example the level
keyword seems a noop, as any other arg passed to rolling
.
from pandas.testing import assert_frame_equal
assert_frame_equal(
df.groupby(by="obs", group_keys=False).rolling(7).mean(),
df.groupby(by="obs", group_keys=False).rolling(7, level=1).mean(),
)
assert_frame_equal(
df.groupby(by="obs", group_keys=False).rolling(7).mean(),
df.groupby(by="obs", group_keys=False).rolling(7, foo=8, bar='8').mean(),
)
I'm still very confused.
A brief update. After updating pandas to v1.1.1 the above walk around fails with SEGFAULT!
>>> df.groupby(by="obs", group_keys=False).rolling("7d", on=df.index.levels[0])
RollingGroupby [window=604800000000000,min_periods=1,center=False,win_type=freq,axis=0,on=DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
'2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
'2020-01-09', '2020-01-10'],
dtype='datetime64[ns]', name='date', freq=None)]
but
>>> df.groupby(by="obs", group_keys=False).rolling("7d", on=df.index.levels[0]).mean()
Segmentation fault: 11
It seems to me that this issue should be labeled as BUG
SEGFAULT possibly linked to #36018
hi, I wonder if there were any update on this issue, in my situation, I am also unable to do rolling with non-integer windows when there is multi-index in the dataframe. With the simple example that was used in this thread, I would get different error like the following:
df.groupby(by="obs", group_keys=False).rolling("7d", on=df.index.levels[0]) Out[21]: RollingGroupby [window=7d,min_periods=1,center=False,win_type=freq,axis=0,on=DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08', '2020-01-09', '2020-01-10'], dtype='datetime64[ns]', name='date', freq='D'),method=single]
df.groupby(by="obs", group_keys=False).rolling("7d", on=df.index.levels[0]).mean() Traceback (most recent call last):
File "C:\Users\c_yy\AppData\Local\Temp/ipykernel_15476/3165508401.py", line 1, in
File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 1855, in mean return super().mean(*args, engine=engine, engine_kwargs=engine_kwargs, **kwargs)
File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 1309, in mean return self._apply(window_func, name="mean", **kwargs)
File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 594, in _apply result = super()._apply(
File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 545, in _apply return self._apply_blockwise(homogeneous_func, name)
File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 441, in _apply_blockwise new_mgr = mgr.apply(hfunc, ignore_failures=True)
File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 325, in apply applied = b.apply(f, **kwargs)
File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\internals\blocks.py", line 381, in apply result = func(self.values, **kwargs)
File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 431, in hfunc res_values = homogeneous_func(values)
File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 535, in homogeneous_func result = np.apply_along_axis(calc, self.axis, values)
File "<__array_function__ internals>", line 5, in apply_along_axis
File "C:\Users\c_yy\anaconda3\lib\site-packages\numpy\lib\shape_base.py", line 379, in apply_along_axis res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 521, in calc start, end = window_indexer.get_window_bounds(
File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\indexers.py", line 333, in get_window_bounds index_array = self.index_array.take(ensure_platform_int(indices))
IndexError: index 10 is out of bounds for axis 0 with size 10
When using my own real data set (see below), I get different error:
rxx.loc['2015-12-01':'2017-12-29',:]
TotRet SpecRet ... volumedays IsTrading
date order_book_id ...
2015-12-01 000155.XSHE 0.014580 0.007206 ... 53.0 1.0
2015-12-02 000155.XSHE -0.004785 0.002132 ... 53.0 1.0
2015-12-03 000155.XSHE 0.034894 0.006537 ... 53.0 1.0
2015-12-04 000155.XSHE 0.050001 0.021692 ... 53.0 1.0
2015-12-07 000155.XSHE 0.049832 0.026999 ... 53.0 1.0
... ... ... ... ...
2017-12-25 000155.XSHE -0.049707 -0.053313 ... 5.0 1.0
2017-12-26 000155.XSHE -0.023771 -0.013894 ... 6.0 1.0
2017-12-27 000155.XSHE 0.006497 0.014989 ... 7.0 1.0
2017-12-28 000155.XSHE -0.030648 -0.026356 ... 8.0 1.0
2017-12-29 000155.XSHE 0.003335 -0.005979 ... 9.0 1.0
[104 rows x 112 columns]
rxx.groupby(level='order_book_id',group_keys=False).rolling('63d',on=rxx.index.levels[0]).mean() *** ValueError: Length of values (567) does not match length of index (150)
pd.show_versions()
commit : 945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5 python : 3.8.8.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22000 machine : AMD64 processor : AMD64 Family 25 Model 33 Stepping 0, AuthenticAMD byteorder : little LC_ALL : None LANG : en LOCALE : English_United States.1252
pandas : 1.3.4 numpy : 1.21.2 pytz : 2021.3 dateutil : 2.8.2 pip : 21.2.2 setuptools : 58.0.4 Cython : 0.29.24 pytest : 6.2.4 hypothesis : None sphinx : 4.2.0 blosc : None feather : None xlsxwriter : 3.0.1 lxml.etree : 4.6.3 html5lib : 1.1 pymysql : 1.0.2 psycopg2 : None jinja2 : 2.11.3 IPython : 7.29.0 pandas_datareader: None bs4 : 4.10.0 bottleneck : 1.3.2 fsspec : 2021.08.1 fastparquet : None gcsfs : None matplotlib : 3.4.3 numexpr : 2.7.3 odfpy : None openpyxl : 3.0.9 pandas_gbq : None pyarrow : 6.0.0 pyxlsb : None s3fs : None scipy : 1.7.1 sqlalchemy : 1.4.22 tables : 3.6.1 tabulate : 0.8.9 xarray : None xlrd : 2.0.1 xlwt : 1.3.0 numba : 0.53.1
I have to concur. This seems like a useful thing to have. Or if there is an alternative way to handle this use case it could be described in docs.
I've run into this issue as well.
The documentation for df.rolling()
states on=
should be: "a column label or Index level on which to calculate the rolling window". My expectation was that I could pass the name of a multiindex level and .rolling()
would group rows by unique index level values. This all might be better handled by .groupby()
, but I'd love to see more integrated multiindex handling where convenient.
I have a similar issue, which again involves performing rolling actions on a dataframe with duplicate datetimes. I've gotten a complicated workaround using groupby and some other hacks, but a more native support would be greatly appreciated.
The groupby transform method has this exact functionality. The docs explain it pretty well. Although I agree that the windowing API should definitely natively have this functionality.
This is the solution wanted https://www.appsloveworld.com/pandas/100/251/pandas-dataframe-rolling-window-with-multiindex
The doc still mentions this as funtctionality but it doesn't work:
src: Rolling doc
For a DataFrame, a column label or **Index level** on which to calculate the rolling window, rather than the DataFrame’s index.
The doc still mentions this as funtctionality but it doesn't work:
src: Rolling doc
For a DataFrame, a column label or **Index level** on which to calculate the rolling window, rather than the DataFrame’s index.
This is still an issue
The doc still mentions this as functionality but it doesn't work:
src: Rolling doc
For a DataFrame, a column label or **Index level** on which to calculate the rolling window, rather than the DataFrame’s index.
This is still an issue however there is a hacky(?) work around.
Consider the following multi indexed series
test.info() <class 'pandas.core.series.Series'> MultiIndex: 5442161 entries, ('1A1', Timestamp('2024-04-07 17:00:00.431000+0000', tz='UTC')) to ('3B9', Timestamp('2024-04-> 28 16:59:58.468000+0000', tz='UTC')) Series name: supplytemp Non-Null Count Dtype -------------- ----- 5442159 non-null float64 dtypes: float64(1) test.index.names FrozenList(['site', 'TimeStamp']) Test['rolling_avg_supply_temp'] = Test.groupby(level='site').rolling('1h',on=Test.index.get_level_values(level='TimeStamp'))['supplytemp'].mean()
While this is a tad more verbose then pandas just acknowledging the closest index this is a workable replacement. The descriptive statistics for the series line up with this method and the non indexed method.
[x] I have searched the [pandas] tag on StackOverflow for similar questions.
[ ] I have asked my usage related question on StackOverflow.
I had a hard time understanding how
df.rolling
works whendf
is indexed by a MultiIndexThis is an example data frame:
which outputs
Now I want to apply a rolling window on the
date
level, keeping theobs
level separate.I tried with no success obvious and simple (least surprise) commands like
df.rolling("7d", index="date")
ordf.rolling("7d", on="date")
but finally the desired result is obtained by
which gives me the correct result:
It seams to me that this should be a quite common situation, so I was wondering if there is a simpler way to obtain the same results. By the way my solution is not very robust, because there are hidden assumptions on how the objects returned by
groupby
are indexed, which do not necessarily hold for a generic data frame.Moreover the doc of the
on
parameter inrolling
was almost incomprehensible to me: I'm still wondering if my usagerolling( "7d", on=df.index.levels[0])
is the intended one or not.