pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.62k stars 17.58k forks source link

Period overflows with dates > pd.Timestamp.max 2262-04-11 #28104

Open paul-lilley opened 4 years ago

paul-lilley commented 4 years ago

Code Sample, a copy-pastable example if possible

import pandas as pd
g = pd.Period(year=2262, month=4, day=11, freq='D')
print(g.start_time)
g2 = pd.Period(year=2262, month=4, day=12, freq='D')
print(g2.start_time)

Output: (with pandas 0.25.0) 2262-04-11 00:00:00 1677-09-21 00:25:26.290448384

with pandas 0.25.0+216.g1aca08aa6 I get an OutOfBoundsDatetime error

print(g2.start_time) Traceback (most recent call last): File "", line 1, in File "pandas/_libs/tslibs/period.pyx", line 1768, in pandas._libs.tslibs.period._Period.start_time.get File "pandas/_libs/tslibs/period.pyx", line 1811, in pandas._libs.tslibs.period._Period.to_timestamp File "pandas/_libs/tslibs/period.pyx", line 1190, in pandas._libs.tslibs.period.period_ordinal_to_dt64 File "pandas/_libs/tslibs/np_datetime.pyx", line 118, in pandas._libs.tslibs.np_datetime.check_dts_bounds pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2262-04-12 00:00:00

Problem description

Overflow means that the Period is incorrect, and also hinders use of Period for date/times outside the limits of pd.Timestamp

Expected Output

Output: 2262-04-11 00:00:00 2262-04-12 00:00:00

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.6.6.final.0 python-bits : 64 OS : Windows OS-release : 7 machine : AMD64 processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None pandas : 0.25.0 numpy : 1.16.3 pytz : 2018.9 dateutil : 2.7.5 pip : 19.1.1 setuptools : 40.8.0 Cython : None pytest : 4.2.0 hypothesis : None sphinx : 2.1.2 blosc : None feather : None xlsxwriter : 1.1.2 lxml.etree : 4.3.3 html5lib : None pymysql : None psycopg2 : None jinja2 : 2.10 IPython : None pandas_datareader: None bs4 : 4.7.1 bottleneck : None fastparquet : None gcsfs : None lxml.etree : 4.3.3 matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None s3fs : None scipy : None sqlalchemy : 1.3.1 tables : None xarray : None xlrd : 1.2.0 xlwt : None xlsxwriter : 1.1.2
jbrockmendel commented 4 years ago

pd.Timestamp is only implemented within the bounds Timestamp('1677-09-21 00:12:43.145225') to Timestamp('2262-04-11 23:47:16.854775807'). Outside of that range we could have start_time and end_time return datetime objects I guess

paul-lilley commented 4 years ago

This looks related to https://github.com/pandas-dev/pandas/pull/27916/ but outside my comfort zone to try altering to return python datetime objects for start_time and end_time when the Period is outside the range of pd.Timestamp

TomAugspurger commented 4 years ago

I don't think the return type should depend on the value.

Is there any advantage to returning a Timestamp, rather than always return a datetime? If not, we could deprecate Period.start_time and Period.end_time in favor of a property that always returns a datetime.

On Tue, Aug 27, 2019 at 3:04 PM Paul Lilley notifications@github.com wrote:

This looks related to https://github.com/pandas-dev/pandas/pull/27916/ http://url but outside my comfort zone to try altering to return python datetime objects for start_time and end_time when the Period is outside the range of pd.Timestamp

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/28104?email_source=notifications&email_token=AAKAOIQ6OBZ2GYMDDRAFEB3QGWCEFA5CNFSM4IOZMFA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5I6JWY#issuecomment-525460699, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIRN3CNYW5YMXGZR53TQGWCEFANCNFSM4IOZMFAQ .

paul-lilley commented 4 years ago

I can't really comment on the advantages of returning a Timestamp, but as Period is the recommended means to handle datetimes outside pd.Timestamp (https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#representing-out-of-bounds-spans) it makes more sense (at least to me) that Period methods don't raise exceptions with datetimes outside pd.Timestamp. My limited use case is trying to resolve https://github.com/pandas-dev/pandas/issues/20927, which currently I've done using datetime objects

TomAugspurger commented 4 years ago

but as Period is the recommended means to handle datetimes outside pd.Timestamp

That's mostly in the context of an array of values inside a Series / DataFrame. When we're returning scalar, there's no need to return a Period (which has different semantics to a datetime).

paul-lilley commented 4 years ago

Good point. I agree that it makes sense to deprecate Period.start_time and Period.end_time in favor of a property that always returns a datetime

TomAugspurger commented 4 years ago

Need a name. Perhaps Period.start_datetime?

On Tue, Aug 27, 2019 at 3:32 PM Paul Lilley notifications@github.com wrote:

Good point. I agree that it makes sense to deprecate Period.start_time and Period.end_time in favor of a property that always returns a datetime

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/28104?email_source=notifications&email_token=AAKAOIR3SJQXOX2XL7JW4PDQGWFNHA5CNFSM4IOZMFA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5JAY6I#issuecomment-525470841, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOITUL2A7FTGXDAWLQKLQGWFNHANCNFSM4IOZMFAQ .

paul-lilley commented 4 years ago

Sounds good, Period.start_datetime and Period.end_datetime

jbrockmendel commented 4 years ago

this will still run up against the datetime implementation bounds. do we care about those?

paul-lilley commented 4 years ago

Fair point, though at least the datetime.MINYEAR to datetime.MAXYEAR range covers the typical 'max-date' values (9999-12-31 in DB2 and Teradata, 31DEC9999 in SAS) that I've encountered in datawarehousing tables used to implement type 2 slowly changing dimensions with start_date/end_date. Often the currently valid row is marked by end_date = some (arbitrarily high) 'max-date' to avoid nulls in the column.

TomAugspurger commented 4 years ago

I personally don't care about representing date times that python can't.

On Tue, Aug 27, 2019 at 11:21 PM Paul Lilley notifications@github.com wrote:

Fair point, though at least the datetime.MINYEAR to datetime.MAXYEAR range covers the typical 'max-date' values (9999-12-31 in DB2 and Teradata, 31DEC9999 in SAS) that I've encountered in datawarehousing tables used to implement type 2 slowly changing dimensions with start_date/end_date. Often the currently valid row is marked by end_date = some (arbitrarily high) 'max-date' to avoid nulls in the column.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/28104?email_source=notifications&email_token=AAKAOITELKKKPYYPRVK4NX3QGX4M3A5CNFSM4IOZMFA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5J25XA#issuecomment-525577948, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIQDVHANQROUJPAUR6LQGX4M3ANCNFSM4IOZMFAQ .

MinKimIP commented 4 years ago

Hi guys, just to add: when reading SQL results from cx_oracle to pandas, rather than throwing an out of range error, the returned pandas dataframe converts datetimes from 9999-12-31 to a particular timestamp in the year 1822, and 2999-12-31 to a particular timestamp in the year 1830. Basically restarting from 1677 for dates beyond 2262. Shouldn't pandas ideally throw an out of bounds error or return NaT for these dates?

jbrockmendel commented 1 year ago

now that we have non-nano Timestamp support we could handle this