pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.59k stars 17.57k forks source link

BUG: [pyarrow] Bizarre overflow error when subtracting two identical `Index` objects. #59082

Open randolf-scholz opened 1 week ago

randolf-scholz commented 1 week ago

Pandas version checks

Reproducible Example

import pandas as pd

datetimes = [None, "2022-01-01T10:00:30", "2022-01-01T10:01:00"]
dt = pd.Index(datetimes, dtype="timestamp[ms][pyarrow]")

offset = pd.Timestamp("2022-01-01 10:00:30")
unit = pd.Index([pd.Timedelta(30, "s")], dtype="duration[ms][pyarrow]").item()

# %% encode to double[pyarrow]
encoded = (dt - offset) / unit
decoded = (encoded.round().astype(float) * unit + offset).astype(dt.dtype)

# compare original and decoded
pd.testing.assert_index_equal(dt, decoded, exact=True)  # ✅
assert ((dt - dt) == 0).all()  # ✅
assert ((decoded - decoded) == 0).all()  # ✅
assert ((decoded - dt) == 0).all()  # ✅
assert ((dt - decoded) == 0).all()  # ❌ overflow ?!?!

Issue Description

This one is absolutely baffling to me. Two Index objects, despite satisfying assert_index_equal, raise an exception when subtracting each other. I guess assert_index_equal must be omitting some internal differences under the hood?

It occurs after encoding and decoding a timestamp[ms][pyarrow] index to floating and back to timestamp[ms][pyarrow], by subtracting and offset and dividing by some frequency.

Even more weird is that it makes a difference how we define the unit:

# %% Try with different units
unit_a = pd.Timedelta(30, "s")  # <-- this one works!
unit_b = pd.Index([pd.Timedelta(30, "s")], dtype="duration[ms][pyarrow]").item()
assert unit_a == unit_b  # ✅
assert hash(unit_a) == hash(unit_b)  # ✅

# encode to double[pyarrow]
encoded_a = (dt - offset) / unit_a
encoded_b = (dt - offset) / unit_b
pd.testing.assert_index_equal(encoded_a, encoded_b, exact=True)  # ✅

# decode
decoded_a = (encoded_a.round().astype(float) * unit_a + offset).astype(dt.dtype)
decoded_b = (encoded_b.round().astype(float) * unit_b + offset).astype(dt.dtype)
pd.testing.assert_index_equal(decoded_a, decoded_b, exact=True)  # ✅
pd.testing.assert_index_equal(dt, decoded_a, exact=True)  # ✅
pd.testing.assert_index_equal(dt, decoded_b, exact=True)  # ✅
assert ((dt - decoded_a) == 0).all()  # ✅
assert ((dt - decoded_b) == 0).all()  # ❌ overflow ?!?!

Moreover, manually computing the difference in pyarrow works as well:

# %% compute differences in pyarrow:
import pyarrow as pa

x = pa.Array.from_pandas(dt)
y = pa.Array.from_pandas(decoded)
pa.compute.subtract(x, y)  # ✅
pa.compute.subtract(y, x)  # ✅

Expected Behavior

Either assert_index_equal should show some discrepancy, or dt an decoded should behave interchangably.

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.11.7.final.0 python-bits : 64 OS : Linux OS-release : 6.5.0-41-generic Version : #41~22.04.2-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun 3 11:32:55 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.2 numpy : 2.0.0 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 70.1.0 pip : 24.1 Cython : None pytest : 8.2.2 hypothesis : 6.103.5 sphinx : 7.3.7 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.25.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.6.0 gcsfs : None matplotlib : 3.9.0 numba : None numexpr : None odfpy : None openpyxl : 3.1.4 pandas_gbq : None pyarrow : 16.1.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.13.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
mroeschke commented 1 week ago

Thanks for the report. pandas uses the _checked variety of the arithmetic ops, so this example also fails in pyarrow even thought I would expect this not to fail

In [9]: import pyarrow as pa
   ...: 
   ...: x = pa.Array.from_pandas(dt)
   ...: y = pa.Array.from_pandas(decoded)

In [11]: pa.compute.subtract_checked(x, y)
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[11], line 1
----> 1 pa.compute.subtract_checked(x, y)

File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/compute.py:246, in _make_generic_wrapper.<locals>.wrapper(memory_pool, *args)
    244 if args and isinstance(args[0], Expression):
    245     return Expression._call(func_name, list(args))
--> 246 return func.call(args, None, memory_pool)

File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/_compute.pyx:385, in pyarrow._compute.Function.call()

File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/error.pxi:154, in pyarrow.lib.pyarrow_internal_check_status()

File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowInvalid: overflow

In [17]: x
Out[17]: 
<pyarrow.lib.TimestampArray object at 0x150422440>
[
  null,
  2022-01-01 10:00:30.000,
  2022-01-01 10:01:00.000
]

In [18]: y
Out[18]: 
<pyarrow.lib.TimestampArray object at 0x151dc7d60>
[
  null,
  2022-01-01 10:00:30.000,
  2022-01-01 10:01:00.000
]
randolf-scholz commented 1 week ago

Here's a simpler repro directly starting from pyarrow, something seems to go wrong during casting:

from datetime import datetime, timedelta
import pyarrow as pa

timestamps = [None, datetime(2022, 1, 1, 10, 0, 30), datetime(2022, 1, 1, 10, 1, 0)]
x = pa.array(timestamps, type=pa.timestamp("ms"))
pa.compute.subtract_checked(x, x)  # ✅

# convert to pandas and back
y = pa.Array.from_pandas(x.to_pandas())
pa.compute.subtract_checked(y, x)  # ✅
pa.compute.subtract_checked(x, y)  # ❌ overflow

I'll cross post in the pyarrow git.