pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.28k stars 17.8k forks source link

BUG: isin check with numpy array works incorrectly when using UInt64 dtype #59609

Open qltwis opened 3 weeks ago

qltwis commented 3 weeks ago

Pandas version checks

Reproducible Example

import pandas as pd
import numpy as np

pd.Series([635554097106142143],dtype="UInt64").isin(np.array([635554097106142079]))

Issue Description

The isin check returns True, although clearly 635554097106142143 ≠ 635554097106142079

Presumably during the check the values are converted to a dtype with smaller precision.

Expected Behavior

Not using the UInt64 dtype as well as not checking against a numpy array produce the expected result

I.e. both

pd.Series([635554097106142143],dtype="int64").isin(np.array([635554097106142079]))

and

pd.Series([635554097106142143],dtype="UInt64").isin([635554097106142079])

Evaluate to False

Installed Versions

NSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.5.final.0 python-bits : 64 OS : Linux OS-release : 5.19.0-46-generic Version : #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 21 15:35:31 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.2 numpy : 2.0.0 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 69.0.3 pip : 24.0 Cython : None pytest : 7.4.3 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 5.2.2 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.18.1 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2023.10.0 gcsfs : None matplotlib : 3.9.1 numba : 0.60.0 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 17.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.14.0 sqlalchemy : None tables : None tabulate : None xarray : 2024.1.1 xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
rhshadrach commented 1 week ago

Thanks for the report. Note all numbers here are within int64 limits, and NumPy creates the array here as int64. Also you get the expected result by specifying dtype="uint64" in the NumPy array construction. My guess is pandas is converting to floats to do the comparison.

When given an array of int64 to test against, we need to handle the difference in dtypes (uint64 vs int64) and the fact that int64 could hold negatives. Perhaps we could strip out the negative values, then convert everything to uint64? I worry this would become a complex operation.

Further investigations and suggestions are welcome.