pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.87k stars 18.02k forks source link

BUG: when using id_vars in .melt(), the id_vars does not recognize the column name in a multiIndex dataframe #60018

Open ivan-marroquin opened 1 month ago

ivan-marroquin commented 1 month ago

Pandas version checks

Reproducible Example

import pandas as pd
import yfinance as yf
from datetime import date, timedelta

end_date= date.today().strftime("%Y-%m-%d")
start_date= (date.today() - timedelta(days= 365)).strftime("%Y-%m-%d")

tickers= ['RELIANCE.NS', 'TCS.NS', 'INFY.NS', 'HDFCBANK.NS']

data= yf.download(tickers, start= start_date, end= end_date)

data= data.reset_index()

data_melted= data.melt(id_vars= ['Date'], var_name= ['Attribute', 'Ticker'])

Issue Description

KeyError Traceback (most recent call last) Cell In [19], line 15 11 data= data.reset_index() 13 # melt the DataFrame to make it long format where each row is a 14 # unique combination of Date, Ticker, and attributes ---> 15 data_melted= data.melt(id_vars= ['Date'], var_name= ['Attribute', 'Ticker']) 17 # pivot the melted DataFrame to have the attributes (Open, High, Low, etc.) as columns 18 data_pivoted= data_melted.pivot_table(index= ['Date', 'Ticker'], 19 columns= 'Attribute', values= 'value', 20 aggfunc= 'first')

File ~/python_3.9.0/lib/python3.9/site-packages/pandas/core/frame.py:9942, in DataFrame.melt(self, id_vars, value_vars, var_name, value_name, col_level, ignore_index) 9932 @Appender(_shared_docs["melt"] % {"caller": "df.melt(", "other": "melt"}) 9933 def melt( 9934 self, (...) 9940 ignore_index: bool = True, 9941 ) -> DataFrame: -> 9942 return melt( 9943 self, 9944 id_vars=id_vars, 9945 value_vars=value_vars, 9946 var_name=var_name, 9947 value_name=value_name, ... 77 ) 78 if value_vars_was_not_none: 79 frame = frame.iloc[:, algos.unique(idx)]

KeyError: "The following id_vars or value_vars are not present in the DataFrame: ['Date']"

Expected Behavior

The melt function should generate a DataFrame to make it long format where each row is a unique combination of Date, Ticker, and attributes

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.9.0.final.0 python-bits : 64 OS : Linux OS-release : 6.5.0-25-generic Version : #25~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Feb 20 16:09:15 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.2 numpy : 1.23.3 pytz : 2024.2 dateutil : 2.8.2 setuptools : 59.8.0 pip : 22.3.1 Cython : 0.29.32 pytest : None hypothesis : None sphinx : 5.1.1 blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.1 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.5.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.6.1 gcsfs : None matplotlib : None numba : None numexpr : 2.10.1 odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 16.1.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : 2024.6.1 scipy : 1.13.1 sqlalchemy : 2.0.31 tables : None tabulate : 0.9.0 xarray : 2024.6.0 xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None Python 3.9.0
vamsi-verma-s commented 1 month ago

breaking change : #55948

vamsi-verma-s commented 1 month ago

@ivan-marroquin

It does not look like this is supported. Two things -

  1. The id_vars is no longer checked against a flattened index. (changes in https://github.com/pandas-dev/pandas/pull/55948). For multiIndex, this must be a list of tuples.
  2. var_name name must be a scalar as mentioned in the docs. (enforeced in https://github.com/pandas-dev/pandas/pull/55948) It looks like for a multiIndex, the var_names are determined from the index names and cannot be specified.

Therefore, following might be the correct way to go about this -

data.columns.names = ['Attribute', 'Ticker']

df.melt(id_vars= [('Date', '')]).rename(columns={('Date', ''): 'Date'})

Output -

                          Date  Attribute       Ticker         value
0    2023-10-12 00:00:00+00:00  Adj Close  HDFCBANK.NS  1.528971e+03
1    2023-10-13 00:00:00+00:00  Adj Close  HDFCBANK.NS  1.515061e+03
2    2023-10-16 00:00:00+00:00  Adj Close  HDFCBANK.NS  1.508994e+03
3    2023-10-17 00:00:00+00:00  Adj Close  HDFCBANK.NS  1.520438e+03
4    2023-10-18 00:00:00+00:00  Adj Close  HDFCBANK.NS  1.499277e+03
...                        ...        ...          ...           ...
5851 2024-10-04 00:00:00+00:00     Volume       TCS.NS  2.965463e+06
5852 2024-10-07 00:00:00+00:00     Volume       TCS.NS  1.472619e+06
5853 2024-10-08 00:00:00+00:00     Volume       TCS.NS  1.541867e+06
5854 2024-10-09 00:00:00+00:00     Volume       TCS.NS  1.082504e+06
5855 2024-10-10 00:00:00+00:00     Volume       TCS.NS  2.378875e+06

[5856 rows x 4 columns]
rhshadrach commented 1 month ago

@ivan-marroquin - can you post a reproducible example that does not depend on yfinance.

thunfischtoast commented 4 weeks ago

@rhshadrach I could reproduce it like this. Worked on pandas 2.1.4, but not on 2.2.0

import pandas as pd
df = pd.DataFrame({"x": ["a","a","b","b"], "y": ["c","d","c","d"], "z": [1,2,3,4]}).set_index("x")
df.columns = pd.MultiIndex.from_tuples([("y", "i"), ("y", "j")])
df.reset_index().melt(id_vars="x")

Output in 2.1.4:


| x | variable_0 | variable_1 | value
-- | -- | -- | -- | --
a | y | i | c
a | y | i | d
b | y | i | c
b | y | i | d
a | y | j | 1
a | y | j | 2
b | y | j | 3
b | y | j | 4

Output in 2.2.0:

KeyError                                  Traceback (most recent call last)
Cell In[5], line 1
----> 1 df.reset_index().melt(id_vars="x")

File <project path>\env\lib\site-packages\pandas\core\frame.py:9915, in DataFrame.melt(self, id_vars, value_vars, var_name, value_name, col_level, ignore_index)
   9905 @Appender(_shared_docs["melt"] % {"caller": "df.melt(", "other": "melt"})
   9906 def melt(
   9907     self,
   (...)
   9913     ignore_index: bool = True,
   9914 ) -> DataFrame:
-> 9915     return melt(
   9916         self,
   9917         id_vars=id_vars,
   9918         value_vars=value_vars,
   9919         var_name=var_name,
   9920         value_name=value_name,
   9921         col_level=col_level,
   9922         ignore_index=ignore_index,
   9923     ).__finalize__(self, method="melt")

File <project path>\env\lib\site-packages\pandas\core\reshape\melt.py:74, in melt(frame, id_vars, value_vars, var_name, value_name, col_level, ignore_index)
    70 if missing.any():
     71     missing_labels = [
     72         lab for lab, not_found in zip(labels, missing) if not_found
     73     ]
---> 74     raise KeyError(
     75         "The following id_vars or value_vars are not present in "
     76         f"the DataFrame: {missing_labels}"
     77     )
     78 if value_vars_was_not_none:
     79     frame = frame.iloc[:, algos.unique(idx)]

KeyError: "The following id_vars or value_vars are not present in the DataFrame: ['x']"
rhshadrach commented 3 weeks ago

In various places, we do allow "x" to match ("x", "") for a MultiIndex. However, this is not the case for MultiIndex.get_indexer.

df = pd.DataFrame({"x": ["a","a","b","b"], "y": ["c","d","c","d"], "z": [1,2,3,4]}).set_index("x")
df.columns = pd.MultiIndex.from_tuples([("y", "i"), ("y", "j")])
df = df.reset_index()
print(df.columns.get_indexer(pd.Index(["x"])))
# [-1]

If we returned [0] here instead, this regression would also be fixed.

cc @mroeschke