sassoftware / saspy

A Python interface module to the SAS System. It works with Linux, Windows, and Mainframe SAS as well as with SAS in Viya.
https://sassoftware.github.io/saspy
Other
366 stars 149 forks source link

coerce OutOfBoundsDatetime into 2199 #602

Open rainermensing opened 1 month ago

rainermensing commented 1 month ago

The Problem is that currently, the sasdata2dataframe method involves as step where timestamps are coerced silently using df[dvarlist[i]] = pd.to_datetime(df[dvarlist[i]], errors='coerce'). Pandas has the issue that it cannot handle timestamps larger than a certain size ('2262-04-11 23:47:16.854775807'). SAS however does support timestamps of larger sizes. This means all timestamps larger than this are silently coerced into NaT within pandas. However, these values are not so uncommen i.e. 9999-12-31 as a default 'valid_to' value.

My proposal is to not change the timestamp logic in all relevant methods along these lines:

import pandas as pd

# Sample date string
coerce_into_2199_ts = False
timestamp_format = 'mixed' #avoids warnings of unspecified timestamp
date_string = '9999-01-01' #None # vs '9999-01-01' vs '2000-01-01'

df = pd.DataFrame({'timestamp': [date_string]})

if coerce_into_2199_ts:
    # Convert the date string to a timestamp
    #credit: https://stackoverflow.com/questions/77690640/pandas-to-datetime-when-dates-are-9999-01-01
    df["timestamp2"] = (pd.to_datetime(df["timestamp"], errors='coerce',format=timestamp_format)
                        .mask(lambda x: x.isna() & df["timestamp"].notna(), #handle existing null values.
                                pd.Timestamp('2199-12-31'))
                    )
    # Create a DataFrame with the timestamp
else:
    try:
        df["timestamp2"] = pd.to_datetime(df["timestamp"],format=timestamp_format)
    except pd.errors.OutOfBoundsDatetime:
        raise pd.errors.OutOfBoundsDatetime("The column contains timestamps that are too large. Consider using coerce_into_2199_ts = True")

The default of coerce_into_2199_ts would be False, so that the coercion is not silent anymore. At the same time, the information of the maximum timestamp should be preserved, as well as that of possible existing NaT in the timestamp column. Hence i.e. for sasioiom.py

               if k_dts is None:  # don't override these if user provided their own dtypes
                  for i in range(nvars):
                     if vartype[i] == 'N':
                        if varcat[i] in self._sb.sas_date_fmts + self._sb.sas_time_fmts + self._sb.sas_datetime_fmts:
                           if coerce_into_2199_ts:
                              # Convert the date string to a timestamp
                              #credit: https://stackoverflow.com/questions/77690640/pandas-to-datetime-when-dates-are-9999-01-01
                              df[dvarlist[i]] = (pd.to_datetime(df[dvarlist[i]], errors='coerce',format=timestamp_format)
                                                   .mask(lambda x: x.isna() & df[dvarlist[i]].notna(),
                                                         pd.Timestamp('2199-12-31'))
                                             )
                              # Create a DataFrame with the timestamp
                           else:
                              try:
                                 df[dvarlist[i]] = pd.to_datetime(df[dvarlist[i]],format=timestamp_format)
                              except pd.errors.OutOfBoundsDatetime:
                                 raise pd.errors.OutOfBoundsDatetime("The column contains timestamps that are too large. Consider using coerce_into_2199_ts = True")
rainermensing commented 1 month ago

Actually, I just realized that the ideal solution would be if you could somehow do the preprocessing of the date/timestamp columns in sas aready, thus take all date and timestamp columns and replace too large values with 2199-12-31... Then you did not need to use coerce at all... For the sasdata2parquet, the best solution would be if we just did not rely on pandas anymore but went straight to pyarrow... but that came with it's own pitfalls as we have seen.

tomweber-sas commented 1 month ago

Hey @rainermensing let me take a look at this when I get a chance. It would be good to have an option to address this.

tomweber-sas commented 2 weeks ago

Hey @rainermensing , I have to apologize for taking so long to get to this. I've been swamped w/ internal SAS work, but I'm looking at this now. I hope to have this and the parquet support from the other issue finished up this week. Did you still have some changes for that other parquet issue to see about integrating? Or is it good as it is?

Thanks! Tom

rainermensing commented 2 weeks ago

Hi @tomweber-sas . Thank you for getting back to this. As for this issue, one thing I did not consider when opening this issue is that out of bounds can be very large positive AND negative dates. So you don't want to cast both these to the same values but to an upper and lower value.

For the parquet issue, I can only test our full workflow during weekends and we have discovered one more bug last week that I hope I have now finally fixed (and there won't be anything else). I will create a new merge request with final logic some time this or next week so please be patient... I hope this will not cause you any issues

tomweber-sas commented 2 weeks ago

Yes, I started playing with the code sample you provided and I started thinking of adding a couple options to sd2df for max/min timestamp values. If either or both are provided, do it (I'm going to look at doing it in SAS code or the Python code, just to see if either works better). That way users can provide the replacement values they want, and w/out specifying either, the behavior remains the same so as to not break anyone. I hope to finish looking at that tomorrow.

And yes, as for the parquet, once you have those last changes, I can try them out and see about finishing that up too. No problem with any of this, that's what I'm here for :)

Thanks, Tom