pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.75k stars 17.96k forks source link

BUG: .loc operation cannot locate existing index when having single string as index for dataframe ('string',) #57750

Open carlonlv opened 8 months ago

carlonlv commented 8 months ago

Pandas version checks

Reproducible Example

import pandas as pd
temp = pd.DataFrame({'a': [1,2,3], 'b': False}, index=[('h',), ('v',), ('c',)])
print(('h',) in temp.index) ## This would print True
temp.loc[('h',), 'b'] = True ## This would result in key error

Issue Description

It seems like when having indices looking at ('a',), pandas automatically converts it into string 'a'.

KeyError: "None of [Index(['h'], dtype='object')] are in the [index]"

Expected Behavior

temp.loc[('h',)] operation should be successful

Installed Versions

INSTALLED VERSIONS ------------------ commit : bdc79c146c2e32f2cab629be240f01658cfb6cc2 python : 3.11.5.final.0 python-bits : 64 OS : Linux OS-release : 6.5.0-1015-azure Version : #15~22.04.1-Ubuntu SMP Tue Feb 13 01:15:12 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.1 numpy : 1.26.3 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 68.2.2 pip : 23.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.20.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : 1.3.7 dataframe-api-compat : None fastparquet : None fsspec : 2023.12.2 gcsfs : None matplotlib : 3.8.3 numba : None numexpr : 2.8.7 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : 0.19.0 tzdata : 2023.3 qtpy : None pyqt5 : None
carlonlv commented 8 months ago

Note that the same operation with tuples with length >= 2 works as expected.

Also in the above example, if I do temp.loc[temp.index == ('h',), 'b'] = True, it also works fine

mbarki-mohamed commented 8 months ago

Please have a look at the following answer on StackOverflow; (it's an old one but still relevant) https://stackoverflow.com/questions/40186361/pandas-dataframe-with-tuple-of-strings-as-index

It will work if you pass the tuple inside a list as argument to the loc function (explanation in the link) temp.loc[('h',), 'b'] = True # error to temp.loc[[('h',)], 'b'] = True # works fine

Hope this helps !

carlonlv commented 8 months ago

Thanks for the clarification. I was simply using the dataframe as dictionary. I guess this is one of the "gotcha" moment.

rhshadrach commented 8 months ago

Thanks for the report. pandas uses tuples to signify values in a MultiIndex, and this is the reason why your lookup fails. One idea is to treat non-MulitIndexes differently here, allowing for lookups with tuples, whereas supplying tuples when the index is a MultiIndex would interpret them as levels. Perhaps this has some bad implications though, further investigations are welcome!