pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.28k stars 17.8k forks source link

as_index = False not working for applying groupby rolling agg to DataFrame #31007

Open hsorsky opened 4 years ago

hsorsky commented 4 years ago

Problem description

When using a rolling agg function on a groupby object, we cannot ommit the groupby columns from the resulting dataframe's index using as_index = False, like we can when applying a non-rolling agg function.

Code Sample

import pandas as pd

data = {
    'groupby_col': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', ],
    'agg_col': [1, 1, 0, 1, 0, 0, 0, 0, 1, 0],
}
df = pd.DataFrame(data)
df.groupby(['groupby_col'], as_index=False).rolling(4).agg({'agg_col': 'mean'})

Expected Output

pd.DataFrame({'agg_col': {
  0: np.nan,
  1: np.nan,
  2: np.nan,
  3: 0.75,
  4: 0.5,
  5: np.nan,
  6: np.nan,
  7: np.nan,
  8: 0.25,
  9: 0.25}})
agg_col
0 NaN
1 NaN
2 NaN
3 0.75
4 0.50
5 NaN
6 NaN
7 NaN
8 0.25
9 0.25

Actual Output

pd.DataFrame({'agg_col': {
  (0, 0): np.nan,
  (0, 1): np.nan,
  (0, 2): np.nan,
  (0, 3): 0.75,
  (0, 4): 0.5,
  (1, 5): np.nan,
  (1, 6): np.nan,
  (1, 7): np.nan,
  (1, 8): 0.25,
  (1, 9): 0.25}})
agg_col
0 0 NaN
1 NaN
2 NaN
3 0.75
4 0.50
1 5 NaN
6 NaN
7 NaN
8 0.25
9 0.25

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.3.final.0 python-bits : 64 OS : Darwin OS-release : 18.7.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : None LOCALE : en_US.UTF-8 pandas : 0.25.1 numpy : 1.17.4 pytz : 2019.3 dateutil : 2.8.0 pip : 19.3.1 setuptools : 40.8.0 Cython : 0.29.14 pytest : 5.3.2 hypothesis : None sphinx : 2.3.0 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : 1.0.1 pymysql : None psycopg2 : 2.8.4 (dt dec pq3 ext lo64) jinja2 : 2.10.3 IPython : 7.10.2 pandas_datareader: None bs4 : 4.8.1 bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.1.2 numexpr : 2.7.1 odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.15.1 pytables : None s3fs : None scipy : 1.3.0 sqlalchemy : 1.3.11 tables : 3.6.1 xarray : None xlrd : None xlwt : None xlsxwriter : None
simonjayhawkins commented 4 years ago

@hsorsky thanks for the report.

This looks similar to #32240

AFAICT, the returned index is probably correct, but the original groupby_col should not be dropped from the results with as_index=False

phofl commented 4 years ago

@simonjayhawkins This has actually a different origin. The rolling function transforms the DataFrameGroupBy into a RollingGroupBy object, which has a completly different aggregate function. The RollingGroupBy object stores the as_index flag under obj._groupby.as_index. The flag is never accessed. So the code execution is independent of this flag when using rolling with aggregate.

Edit: That was a bit misleading. The actual aggregeation is done the same way. But the process steps after the aggregation are completly different. The rolling part runs into pandas/core/window/rolling.py line 603

rhshadrach commented 1 year ago

I'm not sure what the right output here is. On the one hand, doing .groupby(...).rolling(...).agg(...) is a transformation (always has exactly one row per input row). Should we adhere to the semantics of a transformation? That would mean this operation should ignore as_index and the current index is incorrect (it should have the same index as the input DataFrame).

On other other hand, users often seem confused that as_index is ignored for transforms. I personally think we should expand on the options where the groups are in the result (or not). This is #49543.

ihsansecer commented 1 year ago

I agree. It is more sensible to ignore as_index and return index as is for rolling, ewm and expanding transformations. At least until a decision is made for #49543 I guess? But right now only agg(dict_like) and agg(list_like) behave as such. .agg(string), .agg(callable) or functions such as .mean, .sum alter index even though all are doing a transformation. #54973 is just fixing this inconsistency

rhshadrach commented 1 year ago

I agree. It is more sensible to ignore as_index and return index as is for rolling, ewm and expanding transformations.

https://github.com/pandas-dev/pandas/pull/54973 is just fixing this inconsistency

Agreed that fixing the inconsistency is positive, but I think we want to avoid changing behavior on users as a bugfix and then changing it again as a bugfix if we can just do it all at once.

ihsansecer commented 1 year ago

Fair enough. I can make changes so that all rolling, ewm and expanding behaves like other transformations does or shall I left this to be implemented along with changes proposed in #49543?