pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.81k stars 17.98k forks source link

df.groupby(df.index) drops index info when df has a MultiIndex #24786

Open rben01 opened 5 years ago

rben01 commented 5 years ago

Code Sample, a copy-pastable example if possible

>>> df = pd.DataFrame( 
    {'a': list(range(10))},
    index=pd.MultiIndex.from_arrays(
        [[0,1,0,1,0,1,0,1,0,1], [0,1,2,0,1,2,0,1,2,0]],
        names=['l1', 'l2'])
)

>>> df
       a
l1 l2   
0  0   0
1  1   1
0  2   2
1  0   3
0  1   4
1  2   5
0  0   6
1  1   7
0  2   8
1  0   9

# explicitly groupby on level names or indices
>>> df.groupby(['l1', 'l2']).sum()  # or df.groupby(level=list(range(df.index.nlevels))).sum()
        a
l1 l2    
0  0    6
   1    4
   2   10
1  0   12
   1    8
   2    5

# groupby on the multi index itself
# instead of a MultiIndex DataFrame,
# returns a single-level-indexed DataFrame with tuples in the index
>>> df.groupby(df.index).sum()
         a
(0, 0)   6
(0, 1)   4
(0, 2)  10
(1, 0)  12
(1, 1)   8
(1, 2)   5

Problem description

When you group a DataFrame, whose index is a MultiIndex, on its index, resulting aggregations will be a DataFrame with a single-level index containing the tuples from the original MultiIndex. This is inferior to the behavior you obtain when passing the level names to df.groupby, which returns a DataFrame with the same MultiIndex levels and names.

Expected Output

When df has a MultiIndex, df.groupby(df.index) should be be identical to df.groupby(level=list(range(df.index.nlevels))) (or df.groupby(df.index.names) in the event that all of df's index levels are named).

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.8.final.0 python-bits: 64 OS: Darwin OS-release: 18.2.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.23.4 pytest: 4.0.2 pip: 18.1 setuptools: 40.6.3 Cython: None numpy: 1.15.4 scipy: 1.1.0 pyarrow: None xarray: None IPython: 7.2.0 sphinx: None patsy: 0.5.1 dateutil: 2.7.5 pytz: 2018.7 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 3.0.2 openpyxl: None xlrd: 1.2.0 xlwt: None xlsxwriter: None lxml: None bs4: 4.6.3 html5lib: 1.0.1 sqlalchemy: 1.2.15 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
WillAyd commented 5 years ago

Hmm OK makes sense. Investigation into the issue and PRs are always welcome!

rhshadrach commented 1 year ago

Depending on how you read the API docs, passing an index is either not explicitly supported or this is the "expected" behavior (the closest acceptable input to by for an Index is "list", but maybe this should say "list-like").

This occurs because

print(list(df.index))
[(0, 0), (1, 1), (0, 2), (1, 0), (0, 1), (1, 2), (0, 0), (1, 1), (0, 2), (1, 0)]

So I think this is either a docs issue or an enhancement request. I'm open to carving out special handling of passing a MultiIndex into groupby.

cc @jbrockmendel for any thoughts.