pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.69k stars 17.92k forks source link

BUG: `.unstack()` malfunctions for triple Indices??? #55840

Open kwhkim opened 12 months ago

kwhkim commented 12 months ago

Pandas version checks

Reproducible Example

>>> import statsmodels.api as sm 
>>> import pandas as pd
>>> mtcars = sm.datasets.get_rdataset('mtcars').data
>>> dat2 = mtcars[['cyl', 'gear', 'carb']]\
... .value_counts()
>>> dat2.unstack([0,1,2])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/kwhkim/miniforge3/envs/test_pandas/lib/python3.9/site-packages/pandas/core/series.py", line 4455, in unstack
    return unstack(self, level, fill_value, sort)
  File "/Users/kwhkim/miniforge3/envs/test_pandas/lib/python3.9/site-packages/pandas/core/reshape/reshape.py", line 494, in unstack
    return _unstack_multiple(obj, level, fill_value=fill_value, sort=sort)
  File "/Users/kwhkim/miniforge3/envs/test_pandas/lib/python3.9/site-packages/pandas/core/reshape/reshape.py", line 445, in _unstack_multiple
    unstacked = dummy.unstack("__placeholder__", fill_value=fill_value, sort=sort)
  File "/Users/kwhkim/miniforge3/envs/test_pandas/lib/python3.9/site-packages/pandas/core/series.py", line 4455, in unstack
    return unstack(self, level, fill_value, sort)
  File "/Users/kwhkim/miniforge3/envs/test_pandas/lib/python3.9/site-packages/pandas/core/reshape/reshape.py", line 511, in unstack
    raise ValueError(
ValueError: index must be a MultiIndex to unstack, <class 'pandas.core.indexes.base.Index'> was passed
>>> dat2.unstack([0,1])
cyl     8    4    6    4    6    4    6    8
gear    3    4    4    5    3    3    5    5
carb                                        
1     NaN  4.0  NaN  NaN  2.0  1.0  NaN  NaN
2     4.0  4.0  NaN  2.0  NaN  NaN  NaN  NaN
3     3.0  NaN  NaN  NaN  NaN  NaN  NaN  NaN
4     5.0  NaN  4.0  NaN  NaN  NaN  NaN  1.0
6     NaN  NaN  NaN  NaN  NaN  NaN  1.0  NaN
8     NaN  NaN  NaN  NaN  NaN  NaN  NaN  1.0
>>> type(dat2.unstack().unstack().unstack())
<class 'pandas.core.series.Series'>

Issue Description

.unstack() does not work for triple(?) MultiIndex.

.stack().stack().stack() works but the result is not expected nor correct(the result is pd.Series)

Expected Behavior

.unstack([0,1,2]) for triple MultiIndex should produce 1-row DataFrame with triple MultiIndex for columns

Installed Versions

/Users/kwhkim/miniforge3/envs/test_pandas/lib/python3.9/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") INSTALLED VERSIONS ------------------ commit : a60ad39b4a9febdea9a59d602dad44b1538b0ea5 python : 3.9.18.final.0 python-bits : 64 OS : Darwin OS-release : 20.6.0 Version : Darwin Kernel Version 20.6.0: Mon Aug 29 04:31:12 PDT 2022; root:xnu-7195.141.39~2/RELEASE_ARM64_T8101 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : ko_KR.UTF-8 LOCALE : ko_KR.UTF-8 pandas : 2.1.2 numpy : 1.26.0 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 68.2.2 pip : 23.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None bs4 : None bottleneck : None dataframe-api-compat: None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.11.3 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
rhshadrach commented 12 months ago

@kwhkim - thanks for the report. Can you elaborate on this:

.stack().stack().stack() works but the result is not expected nor correct(the result is pd.Series)

What result do you get and what result do you expect?

kwhkim commented 12 months ago

@rhshadrach As I read the documentation for the method .unstack(), it looks like nothing wrong with .unstack().unstack().unstack() according to the doc; confusion is mine.

I thought .unstack() should be similar to .T when the index is not MultiIndex... And then again, the design looks somewhat different than what I would design, because,

dat = pd.DataFrame({'x':[1,3,2,4,5]})
dat.unstack() # works
dat.unstack().stack() # AttributeError!

dat.unstack().stack() does not work for one-column DataFrame... I don't know if there is anthoer deep understanding about how .unstack() or .stack() should do...

rhshadrach commented 11 months ago

.unstack() does not work for triple(?) MultiIndex.

Agreed - this looks like a bug to me. Further investigations and PRs to fix are welcome. This is a simpler reproducer.

df = pd.DataFrame({'a': [1, 1, 2], 'b': [1, 2, 1], 'c': [3, 4, 5]}).set_index(['a', 'b'])
df['c'].unstack([0, 1])

And then again, the design looks somewhat different than what I would design, because,

I do not know what you are trying to say here. How would you design it?

dat.unstack().stack() # AttributeError!

You are trying to call .stack() on a Series. This method is not implemented for a Series.

kwhkim commented 11 months ago

Yes, I am just wondering if dat.unstack() resulting in a Series looks appropriate design-wise,

which breaks the rule of dat.unstack().stack() results in the original dat.

kwhkim commented 11 months ago

take