pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.94k stars 18.04k forks source link

BUG: large datetimes incorrect in read_sas #56014

Open jbrockmendel opened 1 year ago

jbrockmendel commented 1 year ago

There are a few tests in tests.io.sas.test_sas7bdat that need attention from someone with access to SAS.

def test_max_sas_date(datapath):
    # GH 20927
    # NB. max datetime in SAS dataset is 31DEC9999:23:59:59.999
    #    but this is read as 29DEC9999:23:59:59.998993 by a buggy
    #    sas7bdat module

We need to read the file "pandas/tests/io/sas/data/max_sas_date.sas7bdat" with SAS and inspect the first entry in the dt_as_dt column to confirm the expected behavior. It is also unclear what precision the result should have: the comment suggests milliseconds, we get microseconds, and pyreadstat rounds to seconds (xref https://github.com/WizardMac/ReadStat/issues/306)

def test_date_time(datapath):
    [...]
    # GH 19732: Timestamps imported from sas will incur floating point errors

Similarly we need to read the file "pandas/tests/io/sas/data/datetime.sas7bdat" with SAS and inspect the first and last entries in the "DateTimeHi" column to ensure we are getting these right.

ChadFulton commented 1 year ago

Opening max_sas_date.sas7bdat in "SAS 9.4 TS Level 1M6" in Linux yields the following table (in "SAS:VIEWTABLE" in "Browse" mode):

text dt_as_float dt_as_dt date_as_float date_as_date
1 max 253717747199.999 31DEC9999:23:59:59.999 2936547 9999-12-31
2 normal 1880323199.999 01AUG2019:23:59:59.999 21762 2019-08-01
ChadFulton commented 1 year ago

And the datetime.sas7bdat shows:

Date1 Date2 DateTime DateTimeHi Taiw
1 1677-09-22 22SEP77 21SEP1677:00:12:44 21SEP1677:00:12:43.145225525 0001/01/01
2 1960-01-01 01JAN60 01JAN1960:00:00:00 01JAN1960:00:00:00.000000000 0049/01/01
3 2016-02-29 29FEB16 29FEB2016:23:59:59 29FEB2016:23:59:59.123456001 0105/02/29
4 2262-04-11 11APR62 11APR2262:23:47:16 11APR2262:23:47:16.854774475 0351/04/11
jbrockmendel commented 1 year ago

Thanks @ChadFulton, i'll update the comments in the file to point back here as the Source Of Truth.

gtumuluri commented 2 months ago

I am also having this error where the read_sas simply fails and there is no other option whatsoever to read and salvage the file / good data.

>>> df = pd.read_sas('my_sas_file.sas7bdat')

Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 83, in _convert_datetimes
    return pd.to_datetime(sas_datetimes, unit=unit, origin="1960-01-01")
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/tools/datetimes.py", line 1068, in to_datetime
    values = convert_listlike(arg._values, format)
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/tools/datetimes.py", line 393, in _convert_listlike_datetimes
    return _to_datetime_with_unit(arg, unit, name, tz, errors)
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/tools/datetimes.py", line 557, in _to_datetime_with_unit
    arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
  File "pandas/_libs/tslib.pyx", line 312, in pandas._libs.tslib.array_with_unit_to_datetime
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: cannot convert input with unit 's'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sasreader.py", line 175, in read_sas
    return reader.read()
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 742, in read
    rslt = self._chunk_to_dataframe()
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 792, in _chunk_to_dataframe
    rslt[name] = _convert_datetimes(rslt[name], "s")
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 85, in _convert_datetimes
    s_series = sas_datetimes.apply(_parse_datetime, unit=unit)
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/series.py", line 4771, in apply
    return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/apply.py", line 1123, in apply
    return self.apply_standard()
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/apply.py", line 1174, in apply_standard
    mapped = lib.map_infer(
  File "pandas/_libs/lib.pyx", line 2924, in pandas._libs.lib.map_infer
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/apply.py", line 142, in f
    return func(x, *args, **kwargs)
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 55, in _parse_datetime
    return datetime(1960, 1, 1) + timedelta(seconds=sas_datetime)
OverflowError: days=-1176508800; must have magnitude <= 999999999

Reading the same file in R using the haven package works just fine and actually shows the bad timestamps clearly.

> library(haven)
> df <- read_sas('my_sas_file.sas7bdat')
> summary(df$DtObgnOrig)
                          Min.                        1st Qu. 
"-3219212-04-24 00:00:00.0000"     "2004-01-27 00:00:00.0000" 
                        Median                           Mean 
    "2008-08-11 00:00:00.0000"     "2009-03-08 10:15:16.7828" 
                       3rd Qu.                           Max. 
    "2014-12-09 00:00:00.0000"     "2027-06-12 00:00:00.0000" 
                          NA's 
                    "93215826"