pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.73k stars 17.94k forks source link

BUG: Using `Series.diff()` on a Series with Periods on Windows shows an overflow error #58320

Open Dr-Irv opened 6 months ago

Dr-Irv commented 6 months ago

Pandas version checks

Reproducible Example

>>> import pandas as pd
>>> pd.Series(
...                     [pd.Period("2012-1-1", freq="D"), pd.Period("2012-1-2", freq="D")]
...                 ).diff()
C:\Anaconda3\envs\pandasstubs\lib\site-packages\pandas\core\arrays\datetimelike.py:1306: RuntimeWarning: overflow encountered in scalar multiply
  new_data = np.array([self.freq.base * x for x in new_i8_data])
0      NaT
1    <Day>
dtype: object

Issue Description

This error only appears on Windows, not on Linux!

Discovered in a pandas-stubs pull request: https://github.com/pandas-dev/pandas-stubs/pull/907/files#r1571049325

Expected Behavior

No error message

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.9.16.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252 pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 68.2.2 pip : 22.3.1 Cython : None pytest : 8.1.1 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 3.2.0 lxml.etree : 5.2.1 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.3 numba : None numexpr : 2.10.0 odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 15.0.2 pyreadstat : 1.2.7 python-calamine : None pyxlsb : 1.0.10 s3fs : None scipy : 1.13.0 sqlalchemy : 2.0.29 tables : 3.9.1 tabulate : 0.9.0 xarray : 2024.3.0 xlrd : 2.0.1 zstandard : 0.22.0 tzdata : 2024.1 qtpy : None pyqt5 : None
Aloqeely commented 6 months ago

I think what's happening internally is that the values in the PeriodArray are converted to int64 values to perform some calculations, and NaT gets converted to a very big value (largest int64 number I think), and then it does some multiplication which will result in overflow for this NaT (2^63), which doesn't really affect anything as that value gets masked right after the multiplication, but the warning is quite misleading.

I think we can solve this by either performing the masking before doing the multiplication, or by doing the multiplication for all non-NaT values.

But I don't know why it happens for Windows only