pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.26k stars 17.8k forks source link

BUG: Incorrect float32/float64 comparison result #59524

Open alexowens90 opened 1 month ago

alexowens90 commented 1 month ago

Pandas version checks

Reproducible Example

import numpy as np
import pandas as pd

lhs = np.float32(0)
rhs = np.float64(3.503246160812043 * (10**-46))
assert lhs < rhs

s = pd.Series([lhs])
assert (s < rhs).iloc[0]

Issue Description

Comparing the two floating point values (of different widths) using < correctly returns True.

Placing one of the floating point values into a Pandas Series and then running the same comparison incorrectly returns False. i.e. the second assertion fails.

Note that the installed versions below use numpy 1.26.4. The issue is not reproducible with numpy 2.X

Expected Behavior

Both assertions should pass

Installed Versions

Released version:

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.10.12.final.0 python-bits : 64 OS : Linux OS-release : 5.15.153.1-microsoft-standard-WSL2 Version : #1 SMP Fri Mar 29 23:14:13 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 59.6.0 pip : 22.0.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

Dev version:

INSTALLED VERSIONS ------------------ commit : 0851ac3b00158fd9ab4b66a6cc3b191eaf06c476 python : 3.10.12 python-bits : 64 OS : Linux OS-release : 5.15.153.1-microsoft-standard-WSL2 Version : #1 SMP Fri Mar 29 23:14:13 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 3.0.0.dev0+1344.g0851ac3b00 numpy : 1.26.4 dateutil : 2.9.0.post0 pip : 22.0.2 Cython : None sphinx : None IPython : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : None fastparquet : None fsspec : None html5lib : None hypothesis : None gcsfs : None jinja2 : None lxml.etree : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None psycopg2 : None pymysql : None pyarrow : None pyreadstat : None pytest : None python-calamine : None pytz : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
alexowens90 commented 1 month ago

Update:

assert (s == rhs).iloc[0]

passes. Worth noting that 1.401298464324817071e-45 is the smallest value representable by a 32 bit float, i.e. 4 times the value from my example

rhshadrach commented 1 month ago

When we do Series - scalar ops, it appears we default to the Series dtype.

ser = pd.Series(np.float32(0))
scalar = np.float64(3.503246160812043 * (10**-46))
print((ser - scalar).dtype)
# float32
print((scalar - ser).dtype)
# float32

Given this, it seems consistent to coerce ser < scalar to the Series dtype as well.

alexowens90 commented 3 weeks ago

There is some inconsistency with integers:

ser = pd.Series(np.uint8(0))
scalar = np.int8(1)
print((ser - scalar).dtype)
# uint8
print((scalar - ser).dtype)
# uint8
print(ser - scalar)
# 0    255
# dtype: uint8

scalar = np.int8(-1)
print(ser + scalar)
# 0   -1
# dtype: int16
rhshadrach commented 3 weeks ago

It's not clear to me if this is inconsistent or a special rule for dealing with negative integers. What would a proposal be?

alexowens90 commented 3 weeks ago

Yes my assumption was that there was an if scalar < 0 condition somewhere in the type promotion rules. It is odd to me that

pd.Series(np.uint8(0)) - np.int8(1) != pd.Series(np.uint8(0)) + np.int8(-1)

It would make more sense if both gave the same result (that of the RHS).

For context, I work on similar processing operations in ArcticDB, and I was using hypothesis+Pandas to declaratively test ArcticDB when I noticed this.

rhshadrach commented 3 weeks ago

No disagreement that this is thorny - as mentioned above I think we would need a concrete proposal on how to handle promotion across operations and dtypes to move forward.

alexowens90 commented 3 weeks ago

Fair. As the ArcticDB code I linked shows, you can do this manually for the standard numeric types for a small number of supported binary operations, but you need to work out this logic for every operation you support, which I assume would be a bit of a mammoth task with Pandas.