pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.35k stars 17.81k forks source link

Adding big offset to timedelta generates a python crash #14080

Closed geoffroy-destaintot closed 5 years ago

geoffroy-destaintot commented 8 years ago

Code Sample, a copy-pastable example if possible

In:
import pandas as pd
from pandas.tseries.frequencies import to_offset

d = pd.Timestamp("2000/1/1")
d + to_offset("D")*100**25
Out:

=> python crash

Fatal Python error: Cannot recover from stack overflow.

Current thread 0x00002b00 (most recent call first): File "C:\Users\geoffroy.destaintot\Miniconda3\envs\pd-0.18\lib\site-packages\pandas\tseries\offsets.py", line 2526 in delta File "C:\Users\geoffroy.destaintot\Miniconda3\envs\pd-0.18\lib\site-packages\pandas\tseries\offsets.py", line 2535 in apply File "C:\Users\geoffroy.destaintot\Miniconda3\envs\pd-0.18\lib\site-packages\pandas\tseries\offsets.py", line 2493 in add File "C:\Users\geoffroy.destaintot\Miniconda3\envs\pd-0.18\lib\site-packages\pandas\tseries\offsets.py", line 390 in radd File "C:\Users\geoffroy.destaintot\Miniconda3\envs\pd-0.18\lib\site-packages\pandas\tseries\offsets.py", line 2535 in apply File "C:\Users\geoffroy.destaintot\Miniconda3\envs\pd-0.18\lib\site-packages\pandas\tseries\offsets.py", line 2493 in add File "C:\Users\geoffroy.destaintot\Miniconda3\envs\pd-0.18\lib\site-packages\pandas\tseries\offsets.py", line 390 in radd ...

Expected Output

Satisfactory behaviour when using python timedeltas:

In:
import datetime as dt
import pandas as pd
from pandas.tseries.frequencies import to_offset

d = pd.Timestamp("2000/1/1")
d + dt.timedelta(days=1)*100**25
Out:

=> python error

Traceback (most recent call last): File "C:/Users/geoffroy.destaintot/Documents/Local/Informatique/Projets/2016-08-django-debug/to_offset_bug.py", line 11, in d + dt.timedelta(days=1)100*25 OverflowError: Python int too large to convert to C long

output of pd.show_versions()

(same behaviour with pandas 0.17.1, 0.16.2, 0.15.2)

INSTALLED VERSIONS

commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel byteorder: little LC_ALL: None LANG: None

pandas: 0.18.1 nose: None pip: 8.1.2 setuptools: 25.1.6 Cython: None numpy: 1.11.1 scipy: None statsmodels: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.5.3 pytz: 2016.6.1 blosc: None bottleneck: None tables: None numexpr: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None boto: None pandas_datareader: None

jreback commented 8 years ago

thought we had an issue for this....

its an wraparound thing I think.

PR's are welcome.

bhaprayan commented 8 years ago

Any pointers on how to fix this?

jreback commented 8 years ago

step thru the code - this hits cython at some point (for the add) then again for the construction of a new Timestamp - think it's crashing there

bhaprayan commented 8 years ago

I generated the stack trace, and stepped through the code. I've isolated the problem to the subset of the trace I've attached. It crashes at the point where it's trying to multiply "self.n" and "self._inc", within the Delta function of the Tick class. Any suggestions on fixing this?

`> /home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(393)radd() -> def radd(self, other): (Pdb) s

/home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(394)radd() -> return self.add(other) (Pdb) s --Call-- /home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(2698)add() -> def add(self, other): (Pdb) s /home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(2699)add() -> if isinstance(other, Tick): (Pdb) s /home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(2704)add() -> elif isinstance(other, ABCPeriod): (Pdb) s --Call-- /home/bhaprayan/Workspace/pandas/pandas/types/generic.py(7)_check() -> @classmethod (Pdb) s /home/bhaprayan/Workspace/pandas/pandas/types/generic.py(9)_check() -> return getattr(inst, attr, '_typ') in comp (Pdb) s --Return-- /home/bhaprayan/Workspace/pandas/pandas/types/generic.py(9)_check()->False -> return getattr(inst, attr, '_typ') in comp (Pdb) s /home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(2706)add() -> try: (Pdb) s /home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(2707)add() -> return self.apply(other) (Pdb) s --Call-- /home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(2746)apply() -> def apply(self, other): (Pdb) s /home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(2748)apply() -> if isinstance(other, (datetime, np.datetime64, date)): (Pdb) s /home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(2749)apply() -> return as_timestamp(other) + self (Pdb) s --Call-- /home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(35)as_timestamp() -> def as_timestamp(obj): (Pdb) s /home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(36)as_timestamp() -> if isinstance(obj, Timestamp): (Pdb) s /home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(37)as_timestamp() -> return obj (Pdb) s --Return-- /home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(37)as_timestamp()->Timestam...0:00:00') -> return obj (Pdb) s --Call-- /home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(2738)delta() -> @property (Pdb) s /home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(2740)delta() -> return self.n * self._inc (Pdb) s OverflowError: 'Python int too large to convert to C long' /home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(2740)delta() -> return self.n * self._inc (Pdb) s --Return-- /home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(2740)delta()->None -> return self.n * self._inc (Pdb) s --Call-- /home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(393)radd() -> def radd(self, other): (Pdb) `

jreback commented 8 years ago

so I think that multiplcation needs a guard on overflow

In [2]: np.iinfo(np.int64).max
Out[2]: 9223372036854775807

In [3]: np.int64(1000000)*np.int64(86400*1e9)
/Users/jreback/miniconda/bin/ipython:1: RuntimeWarning: overflow encountered in long_scalars
  #!/bin/bash /Users/jreback/miniconda/bin/python.app
Out[3]: -5833720368547758080
bhaprayan commented 8 years ago

First, I set a guard on the multiplication overflow. However it's still stuck in a recursive loop, where after catching the OverflowError, it still calls radd.

`ipdb> s

/home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(2741)delta() 2739 def delta(self): 2740 try: -> 2741 self.n * self._inc 2742 except OverflowError: 2743 raise

ipdb> s OverflowError: 'Python int too large to convert to C long'

/home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(2741)delta() 2739 def delta(self): 2740 try: -> 2741 self.n * self._inc 2742 except OverflowError: 2743 raise

ipdb> s

/home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(2742)delta() 2740 try: 2741 self.n * self._inc -> 2742 except OverflowError: 2743 raise 2744

ipdb> s

/home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(2743)delta() 2741 self.n * self._inc 2742 except OverflowError: -> 2743 raise 2744 2745 @property

ipdb> s --Return-- None

/home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(2743)delta() 2741 self.n * self._inc 2742 except OverflowError: -> 2743 raise 2744 2745 @property

ipdb> s --Call--

/home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(393)radd() 391 return NotImplemented 392 --> 393 def radd(self, other): 394 return self.add(other) 395

ipdb> s

/home/bhaprayan/Workspace/pandas/pandas/tseries/offsets.py(394)radd() 392 393 def radd(self, other): --> 394 return self.add(other) 395 396 def sub(self, other): `

guygoldberg commented 7 years ago

Looks like this issue was already solved, by running the reproduction scenario now I get a clear exception: OverflowError: the add operation between <100000000000000000000000000000000000000000000000000 * Days> and 2000-01-01 00:00:00 will overflow

jreback commented 7 years ago

great

do u want to do a PR with some tests ?

dsm054 commented 7 years ago

I put together a quick smoke test, and indeed it looks like things are generating exceptions like they should.

But two offsets, the FY5253Quarter and DateOffset cases, both take forever to fail, ~20s in one case, ~10s in the other, so something's different about them (I haven't given even a cursory glance).

jreback commented 6 years ago

this is already fixed in master if someone would like to add tests in a PR