pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.53k stars 17.88k forks source link

BUG: when using id_vars in .melt(), the id_vars does not recognize the column name in a multiIndex dataframe #60018

Open ivan-marroquin opened 5 days ago

ivan-marroquin commented 5 days ago

Pandas version checks

Reproducible Example

import pandas as pd
import yfinance as yf
from datetime import date, timedelta

end_date= date.today().strftime("%Y-%m-%d")
start_date= (date.today() - timedelta(days= 365)).strftime("%Y-%m-%d")

tickers= ['RELIANCE.NS', 'TCS.NS', 'INFY.NS', 'HDFCBANK.NS']

data= yf.download(tickers, start= start_date, end= end_date)

data= data.reset_index()

data_melted= data.melt(id_vars= ['Date'], var_name= ['Attribute', 'Ticker'])

Issue Description

KeyError Traceback (most recent call last) Cell In [19], line 15 11 data= data.reset_index() 13 # melt the DataFrame to make it long format where each row is a 14 # unique combination of Date, Ticker, and attributes ---> 15 data_melted= data.melt(id_vars= ['Date'], var_name= ['Attribute', 'Ticker']) 17 # pivot the melted DataFrame to have the attributes (Open, High, Low, etc.) as columns 18 data_pivoted= data_melted.pivot_table(index= ['Date', 'Ticker'], 19 columns= 'Attribute', values= 'value', 20 aggfunc= 'first')

File ~/python_3.9.0/lib/python3.9/site-packages/pandas/core/frame.py:9942, in DataFrame.melt(self, id_vars, value_vars, var_name, value_name, col_level, ignore_index) 9932 @Appender(_shared_docs["melt"] % {"caller": "df.melt(", "other": "melt"}) 9933 def melt( 9934 self, (...) 9940 ignore_index: bool = True, 9941 ) -> DataFrame: -> 9942 return melt( 9943 self, 9944 id_vars=id_vars, 9945 value_vars=value_vars, 9946 var_name=var_name, 9947 value_name=value_name, ... 77 ) 78 if value_vars_was_not_none: 79 frame = frame.iloc[:, algos.unique(idx)]

KeyError: "The following id_vars or value_vars are not present in the DataFrame: ['Date']"

Expected Behavior

The melt function should generate a DataFrame to make it long format where each row is a unique combination of Date, Ticker, and attributes

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.9.0.final.0 python-bits : 64 OS : Linux OS-release : 6.5.0-25-generic Version : #25~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Feb 20 16:09:15 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.2 numpy : 1.23.3 pytz : 2024.2 dateutil : 2.8.2 setuptools : 59.8.0 pip : 22.3.1 Cython : 0.29.32 pytest : None hypothesis : None sphinx : 5.1.1 blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.1 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.5.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.6.1 gcsfs : None matplotlib : None numba : None numexpr : 2.10.1 odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 16.1.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : 2024.6.1 scipy : 1.13.1 sqlalchemy : 2.0.31 tables : None tabulate : 0.9.0 xarray : 2024.6.0 xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None Python 3.9.0
vamsi-verma-s commented 2 days ago

breaking change : #55948

vamsi-verma-s commented 2 days ago

@ivan-marroquin

It does not look like this is supported. Two things -

  1. The id_vars is no longer checked against a flattened index. (changes in https://github.com/pandas-dev/pandas/pull/55948). For multiIndex, this must be a list of tuples.
  2. var_name name must be a scalar as mentioned in the docs. (enforeced in https://github.com/pandas-dev/pandas/pull/55948) It looks like for a multiIndex, the var_names are determined from the index names and cannot be specified.

Therefore, following might be the correct way to go about this -

data.columns.names = ['Attribute', 'Ticker']

df.melt(id_vars= [('Date', '')]).rename(columns={('Date', ''): 'Date'})

Output -

                          Date  Attribute       Ticker         value
0    2023-10-12 00:00:00+00:00  Adj Close  HDFCBANK.NS  1.528971e+03
1    2023-10-13 00:00:00+00:00  Adj Close  HDFCBANK.NS  1.515061e+03
2    2023-10-16 00:00:00+00:00  Adj Close  HDFCBANK.NS  1.508994e+03
3    2023-10-17 00:00:00+00:00  Adj Close  HDFCBANK.NS  1.520438e+03
4    2023-10-18 00:00:00+00:00  Adj Close  HDFCBANK.NS  1.499277e+03
...                        ...        ...          ...           ...
5851 2024-10-04 00:00:00+00:00     Volume       TCS.NS  2.965463e+06
5852 2024-10-07 00:00:00+00:00     Volume       TCS.NS  1.472619e+06
5853 2024-10-08 00:00:00+00:00     Volume       TCS.NS  1.541867e+06
5854 2024-10-09 00:00:00+00:00     Volume       TCS.NS  1.082504e+06
5855 2024-10-10 00:00:00+00:00     Volume       TCS.NS  2.378875e+06

[5856 rows x 4 columns]