pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.04k stars 17.72k forks source link

BUG: `<Framelike>.__contains__(<unhashable>)` errors #58909

Open NickCrews opened 2 months ago

NickCrews commented 2 months ago

Pandas version checks

Reproducible Example

import pandas as pd
vals = [{1:2}, {"a":"b"}]
{1:2} in vals # works, as expected
{1:2} in pd.Series(vals). # TypeError

Issue Description

Related: https://github.com/pandas-dev/pandas/issues/36285

Series and dataframes should support __contains__ for unhashable needles. It makes sense to disallow using unhashable types as keys in set-like and map-like collections, because the "identity" of the object can change between insertion time and query time. However, framelikes are more like python lists, which don't have a hash-map-esque behavior.

Am I missing something here that would cause poorly defined behavior?

Expected Behavior

the existing fast hash-based implementation should work for hashable types, but we should have a O(n) fallback implementation for unhashable types.

Installed Versions

INSTALLED VERSIONS ------------------ commit : e8093ba372f9adfe79439d90fe74b0b5b6dea9d6 python : 3.8.10.final.0 python-bits : 64 OS : Linux OS-release : 5.15.0-1064-azure Version : #73~20.04.1-Ubuntu SMP Mon May 6 09:43:44 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.4.3 numpy : 1.23.1 pytz : 2022.1 dateutil : 2.8.2 setuptools : 45.2.0 pip : 20.0.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None markupsafe : None matplotlib : 3.5.2 numba : None numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.9.0 snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None
asishm commented 2 months ago

{1:2} in pd.Series(vals) doesn't do what you think it does. k in pd.Series checks on the index, not the values and pd.Series indexes I believe have to be hashable (in this case the generated index is just a RangeIndex(start=0, stop=2)). Series behave more like dicts than lists with the index serving as the key.

What you are looking for is pd.Series(vals).isin([{1:2}])

NickCrews commented 2 months ago

Thanks @asishm you are right, that is specifically what I was looking for.

Let me rephrase then, I think {1:2} in pd.Index(vals) should not error.

asishm commented 2 months ago

I'll let the dev team comment. My mental model of a pandas Series is that of a dict. so pd.Series({'a': 1, 'b': 2}) is similar to {'a': 1, 'b': 2}. {1:2} in d.keys() errors out for a regular python dict as well with TypeError: unhashable type: 'dict'. So imo, it's consistent with doing that with a pd.Index

However, the fact that you can create an Index with unhashable objects (as you've shown) seems to be a bug. The docs clearly state:

Notes

An Index instance can only contain hashable objects. An Index instance can not hold numpy float16 dtype.

NickCrews commented 2 months ago

My mental model of a pandas Series is that of a dict

I'm guessing you mean a pandas Index? My mental model of a Index is closer to "a list, just with fast lookup", since a series can hold duplicate values eg pd.Index([1, 1, 2]) I hadn't ever even thought about hashable-ness before. My mental model of a Series is definitely plain list-like, I don't expect it to have any quick-lookup functionality like a dict.

You are right, it appears that something between the implementation and docs is out of sync. We need to figure out what are the desired semantics before anything else. pd.Index([1.2, 3.4]) works just fine, so that is a point against the docs.

asishm commented 2 months ago

Sorry, should've been a bit clearer. Series -> dict, Index -> dict.keys() is what I meant to say. The only difference being uniqueness. Highlighting some of th

Dict Series
d.keys() ser.index
d.items() ser.items()
d.values() ser.array (or .values/.to_numpy()) etc.
a in d equiv to a in d.keys() a in ser equiv to a in ser.index
d.keys() - hashable and unique ser.index - hashable, but lacks uniqueness

pd.Index([1.2, 3.4]) works just fine, so that is a point against the docs.

This is fine. pd.Index([1.2, 3.4], dtype='float16') fails with a NotImplementedError: float16 indexes are not supported which is in-line with the docs.

NickCrews commented 2 months ago

This is fine. pd.Index([1.2, 3.4], dtype='float16') fails with a NotImplementedError: float16 indexes are not supported which is in-line with the docs.

Oh shoot, you are right. I totally misread the docs. I interpreted "An Index instance can not hold numpy float16 dtype." as a followup to the previous sentence, eg "for example, indexes can't hold float dtypes, since they are unhashable". (I also mistakenly thought that floating dtypes were unhashable, but looks like they are)