pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.58k stars 17.9k forks source link

isin() returns different results than eq() when mixing dtypes of comparators #16938

Open mansenfranzen opened 7 years ago

mansenfranzen commented 7 years ago

Expected correct behavior for the same compartor dtypes

s = pd.Series([1.2, 2.3])
s.eq(1.2) == s.isin([1.2]) # True, True

s32 = pd.Series([1.2, 2.3], dtype="float32")
s32.eq(np.float32(1.2)) == s32.isin([np.float32(1.2)]) # True, True

Non expected behavior for mixed comparator dtypes

s32.eq(1.2) == s32.isin([1.2]) # False, True

# in detail
s32.eq(1.2) # True, False
s32.isin([1.2]) # False, False

In summary, eq() and isin() return different results when mixing comparator dtypes.

Problem description

Both methods eq() and isin() should return the same result. Here is the related SO article.

This issue might originate in numpy and perhaps is not directly pandas related (see here for more). Scalar comparison (equivalent to eq()) and array comparison (equivalent to isin()) comparison yield different results for mixed comparator dtypes in numpy, too.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.8.0-58-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.20.2 pytest: None pip: 9.0.1 setuptools: 36.0.1 Cython: 0.25.2 numpy: 1.13.1 scipy: 0.19.1 xarray: None IPython: 6.1.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.0.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None
jreback commented 7 years ago

no this has to do with the upcasting rules. numpy is not used here except in a small evaluation case. upcasting of the mixed operands is actually somewhat non-trival, see the code https://github.com/pandas-dev/pandas/blob/master/pandas/core/algorithms.py#L372

welcome for you to add your test case and debug. there are quite a few tests around this so this might be tricky to get right.