pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.68k stars 17.6k forks source link

to_datetime cannot convert dates larger than 2262-04-11T00:00:00.000000000 #21972

Closed paul-lilley closed 5 years ago

paul-lilley commented 5 years ago

Code Sample, a copy-pastable example if possible

import pandas as pd
import datetime

d0 = datetime.date(1970, 1, 1)  # python epoch
d1 = datetime.date(9999, 12, 31)  # oft-used max-date in SAS / DB2

for days in range(106752, (d1 - d0).days):  # 2932896 is the number of days
    try:
        data = {'days_from_python_epoch': [days]}
        df = pd.DataFrame(data)
        df['date_as_date'] = pd.to_datetime(df['days_from_python_epoch'], unit='d', origin='1970-01-01')
    except pd._libs.tslibs.np_datetime.OutOfBoundsDatetime as e:
        print(f'exception with {days}')
        z = d0 + datetime.timedelta(days=days -1)
        print(f'looks like {days} -1 -> {z} is the max possible')  # '2262-04-11T00:00:00.000000000'] is the max possible
        raise e  # 106752 throws exception

Problem description

This causes import of some SAS datasets to fail, and potentially imports/conversions from other sources (such as DB2) where an arbitrarily high date (often 9999-12-31) is used in datawarehousing to indicate currently valid rows with a to_date=9999-12-31

see related issue #20927 - reposted here as the issue is more generic than just SAS dataset imports and maybe #15836

I'm pretty sure the root cause is because to_datetime() converts via nanoseconds ( in _libs.tslibs.timedeltas.cast_from_unit ) and very large dates in nanoseconds do not fit into int64 space.

Expected Output

correctly parsed dates up to and including 9999-12-31

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.6.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.23.1 pytest: 3.6.2 pip: 10.0.1 setuptools: 39.2.0 Cython: 0.28.3 numpy: 1.14.5 scipy: 1.1.0 pyarrow: None xarray: None IPython: 6.4.0 sphinx: 1.7.5 patsy: 0.5.0 dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: 1.2.1 tables: 3.4.4 numexpr: 2.6.5 feather: None matplotlib: 2.2.2 openpyxl: 2.5.4 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.0.5 lxml: 4.2.2 bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: 1.2.8 pymysql: None psycopg2: 2.7.5 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
jschendel commented 5 years ago

This is the expected behavior. See the Timestamp Limitations section of the documentation, as well as the Representing Out-of-Bounds Spans section for a workaround using period_range/PeriodIndex.

paul-lilley commented 5 years ago

Thanks - the Representing Out-of-Bounds Spans section gives a good hint for how issue #20927 might be resolved.