Open jbrockmendel opened 1 year ago
Opening max_sas_date.sas7bdat
in "SAS 9.4 TS Level 1M6" in Linux yields the following table (in "SAS:VIEWTABLE" in "Browse" mode):
text | dt_as_float | dt_as_dt | date_as_float | date_as_date | |
---|---|---|---|---|---|
1 | max | 253717747199.999 | 31DEC9999:23:59:59.999 | 2936547 | 9999-12-31 |
2 | normal | 1880323199.999 | 01AUG2019:23:59:59.999 | 21762 | 2019-08-01 |
And the datetime.sas7bdat
shows:
Date1 | Date2 | DateTime | DateTimeHi | Taiw | |
---|---|---|---|---|---|
1 | 1677-09-22 | 22SEP77 | 21SEP1677:00:12:44 | 21SEP1677:00:12:43.145225525 | 0001/01/01 |
2 | 1960-01-01 | 01JAN60 | 01JAN1960:00:00:00 | 01JAN1960:00:00:00.000000000 | 0049/01/01 |
3 | 2016-02-29 | 29FEB16 | 29FEB2016:23:59:59 | 29FEB2016:23:59:59.123456001 | 0105/02/29 |
4 | 2262-04-11 | 11APR62 | 11APR2262:23:47:16 | 11APR2262:23:47:16.854774475 | 0351/04/11 |
Thanks @ChadFulton, i'll update the comments in the file to point back here as the Source Of Truth.
I am also having this error where the read_sas
simply fails and there is no other option whatsoever to read and salvage the file / good data.
>>> df = pd.read_sas('my_sas_file.sas7bdat')
Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 83, in _convert_datetimes
return pd.to_datetime(sas_datetimes, unit=unit, origin="1960-01-01")
File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/tools/datetimes.py", line 1068, in to_datetime
values = convert_listlike(arg._values, format)
File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/tools/datetimes.py", line 393, in _convert_listlike_datetimes
return _to_datetime_with_unit(arg, unit, name, tz, errors)
File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/tools/datetimes.py", line 557, in _to_datetime_with_unit
arr, tz_parsed = tslib.array_with_unit_to_datetime(arg, unit, errors=errors)
File "pandas/_libs/tslib.pyx", line 312, in pandas._libs.tslib.array_with_unit_to_datetime
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: cannot convert input with unit 's'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/anaconda3/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sasreader.py", line 175, in read_sas
return reader.read()
File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 742, in read
rslt = self._chunk_to_dataframe()
File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 792, in _chunk_to_dataframe
rslt[name] = _convert_datetimes(rslt[name], "s")
File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 85, in _convert_datetimes
s_series = sas_datetimes.apply(_parse_datetime, unit=unit)
File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/series.py", line 4771, in apply
return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/apply.py", line 1123, in apply
return self.apply_standard()
File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/apply.py", line 1174, in apply_standard
mapped = lib.map_infer(
File "pandas/_libs/lib.pyx", line 2924, in pandas._libs.lib.map_infer
File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/apply.py", line 142, in f
return func(x, *args, **kwargs)
File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 55, in _parse_datetime
return datetime(1960, 1, 1) + timedelta(seconds=sas_datetime)
OverflowError: days=-1176508800; must have magnitude <= 999999999
Reading the same file in R using the haven
package works just fine and actually shows the bad timestamps clearly.
> library(haven)
> df <- read_sas('my_sas_file.sas7bdat')
> summary(df$DtObgnOrig)
Min. 1st Qu.
"-3219212-04-24 00:00:00.0000" "2004-01-27 00:00:00.0000"
Median Mean
"2008-08-11 00:00:00.0000" "2009-03-08 10:15:16.7828"
3rd Qu. Max.
"2014-12-09 00:00:00.0000" "2027-06-12 00:00:00.0000"
NA's
"93215826"
There are a few tests in tests.io.sas.test_sas7bdat that need attention from someone with access to SAS.
We need to read the file "pandas/tests/io/sas/data/max_sas_date.sas7bdat" with SAS and inspect the first entry in the dt_as_dt column to confirm the expected behavior. It is also unclear what precision the result should have: the comment suggests milliseconds, we get microseconds, and pyreadstat rounds to seconds (xref https://github.com/WizardMac/ReadStat/issues/306)
Similarly we need to read the file "pandas/tests/io/sas/data/datetime.sas7bdat" with SAS and inspect the first and last entries in the "DateTimeHi" column to ensure we are getting these right.