Open randolf-scholz opened 1 week ago
Thanks for the report. pandas uses the _checked
variety of the arithmetic ops, so this example also fails in pyarrow even thought I would expect this not to fail
In [9]: import pyarrow as pa
...:
...: x = pa.Array.from_pandas(dt)
...: y = pa.Array.from_pandas(decoded)
In [11]: pa.compute.subtract_checked(x, y)
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
Cell In[11], line 1
----> 1 pa.compute.subtract_checked(x, y)
File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/compute.py:246, in _make_generic_wrapper.<locals>.wrapper(memory_pool, *args)
244 if args and isinstance(args[0], Expression):
245 return Expression._call(func_name, list(args))
--> 246 return func.call(args, None, memory_pool)
File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/_compute.pyx:385, in pyarrow._compute.Function.call()
File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/error.pxi:154, in pyarrow.lib.pyarrow_internal_check_status()
File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()
ArrowInvalid: overflow
In [17]: x
Out[17]:
<pyarrow.lib.TimestampArray object at 0x150422440>
[
null,
2022-01-01 10:00:30.000,
2022-01-01 10:01:00.000
]
In [18]: y
Out[18]:
<pyarrow.lib.TimestampArray object at 0x151dc7d60>
[
null,
2022-01-01 10:00:30.000,
2022-01-01 10:01:00.000
]
Here's a simpler repro directly starting from pyarrow, something seems to go wrong during casting:
from datetime import datetime, timedelta
import pyarrow as pa
timestamps = [None, datetime(2022, 1, 1, 10, 0, 30), datetime(2022, 1, 1, 10, 1, 0)]
x = pa.array(timestamps, type=pa.timestamp("ms"))
pa.compute.subtract_checked(x, x) # ✅
# convert to pandas and back
y = pa.Array.from_pandas(x.to_pandas())
pa.compute.subtract_checked(y, x) # ✅
pa.compute.subtract_checked(x, y) # ❌ overflow
I'll cross post in the pyarrow git.
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
This one is absolutely baffling to me. Two
Index
objects, despite satisfyingassert_index_equal
, raise an exception when subtracting each other. I guessassert_index_equal
must be omitting some internal differences under the hood?It occurs after encoding and decoding a
timestamp[ms][pyarrow]
index to floating and back totimestamp[ms][pyarrow]
, by subtracting and offset and dividing by some frequency.Even more weird is that it makes a difference how we define the unit:
Moreover, manually computing the difference in
pyarrow
works as well:Expected Behavior
Either
assert_index_equal
should show some discrepancy, ordt
andecoded
should behave interchangably.Installed Versions