sassoftware / saspy

A Python interface module to the SAS System. It works with Linux, Windows, and Mainframe SAS as well as with SAS in Viya.
https://sassoftware.github.io/saspy
Other
372 stars 150 forks source link

Issues with time conversion when writing out sas datasets with saspy. #279

Closed biojerm closed 4 years ago

biojerm commented 4 years ago

Describe the bug I am reading in datasets with columns that have dates/times with 'funny' formats. In the first column filedate has dates like 11MAR2016 . In the second column filetime, only contains the time no date like 16:34:04. When I read in the dataset to saspy, convert it to a dataframe, and then convert it back into a sas7bdat dataset the filetime has changed. Instead of remaining with a count up from POSIX time, it is converted to today's date at the same time. SO before it was 01/01/1970 16:45:32 it becomes 1/14/2020 16:45:32 (today's date).

Expected behavior I would expect the time to remain a count up from POSIX time and not have today's date added to the time stamp

Screenshots

before: image without formats image

After: image

without formats image

I can provide sample datasets, for troubleshooting if desired.

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

tomweber-sas commented 4 years ago

Yes, this is a know issue that I don't have a good solution for. Your SAS formats aren't really 'funny'. SAS has 3 distinct types for formats; date, time, and datetime (timestamp). Pandas dataframes has datetime64 (timestamp), but it doesn't have individual time or date types. SAS doesn't actually have these three as real data types either, they are all just numbers (floating point double), same an any numeric in SAS. But the format is how to interpret the number: number of seconsd since midnight (Time), number of days since 01JAN1960 (date), or number of seconds since midnight 01JAN1960 (datetime).

When I import a SAS dataset into a dataframe, I check for any of these formats, so I can import these numbers into pandas as the equivalent datetime64[ns], For datetimes, this matches. For dates or times, the date portion, or time portion, which isn't really there, get populated by default (today or midnight, depending). Also, the original SAS formats are lost, as there's no way to store that in a data frame. This is better than just importing them as numbers, as you can at least still get at the correct date part, or time part, even if the other half isn't really 'right'.

This is similar to database interactions where formats can't be round tripped, although, most databases support all three types, so at least a default date/time/datatime/ format can be used and you can round trip dates as dates, time-time datetime-datetime. I'd like to have a real solution for this, and be able to round trip each of the 3 types.

Googling "numpy date time datetime" (pandas dataframes are just built upon numpy) show the following, which is why I don't really have a good way to match type to type to be able to do this.

NumPy has no separate date and time objects, just a single datetime64 object to represent a
single moment in time. The datetime module's datetime object has microsecond precision
(one-millionth of a second). NumPy's datetime64 object allows you to set its precision from
hours all the way to attoseconds (10 ^ -18).

If you have any thought on how to solve this, I'm totally open to it. SAS has the ability to create a variable that is a data or time from a datetime, so that can be done, I just don't know to do it myself. There are functions that can be applied to the datetime, to get just the datepart or timepart. The best thought at the moment is having an option to provide (on df2sd) where you specify the columns and what type you want - maybe a dict with {'col' : 'date or time', ...} and then when I create the dataset, I can generate code to get only that part from the full datetime I have from the dataframe. That might not be so bad actually. But it still requires 'you' to tell me what you want, and it doesn't solve that pandas can't have only a time or date from SAS's time or date, so the dataframes will still have today or midnight, from SAS times or dates.

What are your thoughts?

Thanks, Tom

tomweber-sas commented 4 years ago

@biojerm, I thought theidea of allowing you to specify that you want date or time columns in the SAS dataset that df2sd creates was a good enhancement. I have a new branch with a first pass at this; datetimes.

Here's output from a test. Note than sd2df can only still create datetimes in pandas, so I can't fix that as pandas doesn't have tmie or date. But, you can get them back to SAS easily.

>>> import saspy
>>> import datetime
>>>
>>> #ssh = saspy.SASsession(cfgname='ssh'); ssh
... sas = saspy.SASsession(cfgname='sdssas'); sas
SAS Connection established. Subprocess id is 8623

No encoding value provided. Will try to determine the correct encoding.
Setting encoding to iso8859_15 based upon the SAS session encoding value of latin9.

Access Method         = STDIO
SAS Config name       = sdssas
SAS Config file       = /opt/tom/github/saspy/saspy/sascfg_personal.py
WORK Path             = /sastmp/SAS_work902A000021CB_tom64-5/
SAS Version           = 9.04.01M6D11072018
SASPy Version         = 3.1.9
Teach me SAS          = False
Batch                 = False
Results               = Pandas
SAS Session Encoding  = latin9
Python Encoding value = iso8859_15
SAS process Pid value = 8651

>>> #sas = saspy.SASsession(cfgname='iomj'); sas
... #sas = saspy.SASsession(cfgname='itviya'); sas
...
>>>
>>>
>>> rows = [[datetime.datetime(1965, 1, 1, 8, 0, 1), 1.0, 1.0, 'a'],
...         [datetime.datetime(1966, 1, 1, 7, 0, 2), 2.0, 2.0, 'b'],
...         [datetime.datetime(1967, 1, 1, 6, 0, 3), 3.0, 3.0, ' '],
...         [datetime.datetime(1968, 1, 1, 5, 0, 4), 4.0, 4.0,  ''],
...         [None, 5.0, 5.0, 'b'],
...         [None, 6.0, None, 'b'],
...        ]
>>>
>>> df = pd.DataFrame.from_records(rows, columns=['dt','n1','n2', 's1'])
>>>
>>> sd = sas.df2sd(df, results='text')
>>> #print(sas.saslog().rpartition("data '_df'n;")[2].partition("run;")[0])
... sd.head(99)
                                                           The SAS System                      14:26 Wednesday, January 15, 2020   1

                                        Obs    dt                            n1    n2    s1

                                         1     1965-01-01T08:00:01.000000     1     1    a
                                         2     1966-01-01T07:00:02.000000     2     2    b
                                         3     1967-01-01T06:00:03.000000     3     3
                                         4     1968-01-01T05:00:04.000000     4     4
                                         5                              .     5     5    b
                                         6                              .     6     .    b
>>> sd.columnInfo()

                                                           The SAS System                      14:26 Wednesday, January 15, 2020   2

                                                       The CONTENTS Procedure

                                            Alphabetic List of Variables and Attributes

                                            #    Variable    Type    Len    Format

                                            1    dt          Num       8    E8601DT26.6
                                            2    n1          Num       8
                                            3    n2          Num       8
                                            4    s1          Char      1
>>>
>>> sd = sas.df2sd(df, datetimes={'dt':'date'}, results='text')
>>> #print(sas.saslog().rpartition("data '_df'n;")[2].partition("run;")[0])
... sd.head(99)

                                                           The SAS System                      14:26 Wednesday, January 15, 2020   3

                                                Obs        dt        n1    n2    s1

                                                 1     1965-01-01     1     1    a
                                                 2     1966-01-01     2     2    b
                                                 3     1967-01-01     3     3
                                                 4     1968-01-01     4     4
                                                 5              .     5     5    b
                                                 6              .     6     .    b
>>> sd.columnInfo()

                                                           The SAS System                      14:26 Wednesday, January 15, 2020   4

                                                       The CONTENTS Procedure

                                             Alphabetic List of Variables and Attributes

                                              #    Variable    Type    Len    Format

                                              1    dt          Num       8    E8601DA.
                                              2    n1          Num       8
                                              3    n2          Num       8
                                              4    s1          Char      1
>>>
>>>
>>> sd = sas.df2sd(df, datetimes={'dt':'time'}, results='text')
>>> #print(sas.saslog().rpartition("data '_df'n;")[2].partition("run;")[0])
... sd.head(99)

                                                           The SAS System                      14:26 Wednesday, January 15, 2020   5

                                                 Obs       dt       n1    n2    s1

                                                  1     08:00:01     1     1    a
                                                  2     07:00:02     2     2    b
                                                  3     06:00:03     3     3
                                                  4     05:00:04     4     4
                                                  5            .     5     5    b
                                                  6            .     6     .    b
>>> sd.columnInfo()

                                                           The SAS System                      14:26 Wednesday, January 15, 2020   6

                                                       The CONTENTS Procedure

                                             Alphabetic List of Variables and Attributes

                                              #    Variable    Type    Len    Format

                                              1    dt          Num       8    E8601TM.
                                              2    n1          Num       8
                                              3    n2          Num       8
                                              4    s1          Char      1
>>>
>>>
>>> df = sd.to_df(); df; df.dtypes
                   dt  n1   n2   s1
0 2020-01-15 08:00:01   1  1.0    a
1 2020-01-15 07:00:02   2  2.0    b
2 2020-01-15 06:00:03   3  3.0  NaN
3 2020-01-15 05:00:04   4  4.0  NaN
4                 NaT   5  5.0    b
5                 NaT   6  NaN    b
dt    datetime64[ns]
n1             int64
n2           float64
s1            object
dtype: object
>

What do you think?

Thanks, Tom

biojerm commented 4 years ago

Hey Tom,

So for me the main issue is that the underlying values are changing when I import a SAS dataset to the dataframe. For the Date and Datetime formats in SAS I think the values in the resulting dataframe are fine. However, for my purposes adding in today's date on the front end of any "Time" formatted object pollutes the original data pretty badly. We have had a number of issues with dates in the past, and in general our solution has been to convert the values to strings/characters. Then let the end users convert them back to whatever 'date/time' object/format they want.

Would it be possible to add options similar to what you did for df2sd for the to_df method. Something like to_df(timefmt='string'). Maybe have float and possibly both string/float (I know this would add another column to the df)? There are times when I may not care what time value is, but I need to pass it through python without changing. Having the float option would allow me to do that.

Thanks, Jeremy

tomweber-sas commented 4 years ago

Hey Jeremy, I gotcha. So, we're partly there. I've been investigating possible ways to accomplish the rest of this. Seems there's at least 2 slightly different (and somewhat conflicting) needs. One is simply being able to round trip SAS dataset and end up with the same data in SAS after. That can be accomplished with the new datetimes={} as shown above (when pandas columns are datetimes with half being the 'wrong' values - today or modnight).

But, the other thing is to have only a date or time in pandas, when SAS only had date or time. Pandas doesn't have those types, so when I use it's datetime type with only SAS date or time, pandas adds in the missing half (today, or midnight), and I can't make it not do that.

Now, that being said, there are ways to convert a pandas datetime to only a date or a time; so that can be done. The caveat there is that it changes the datatype to 'Object' which won't then round trip on it's own back to SAS; it'll be a character column in SAS, as the formatted date or time string. This can, however be midigated by converting it back to a datetime before sending to SAS, and using the datetime={} to strip the half that pandas put back in.

Also, you can currently get any of the SAS D, T, DT columns into the dataframe as just numerics. And that then round trips correctly, but you have to then set the format for those columns after, to D, T or DT format, for it to be all the way round tripped correctly. And, that also can be done.

I have all of these cases 'working' manually, by doing the various steps. So, the next thing is to figure out if I can do all of this in saspy with some non-convoluted options, or if documenting these patterns so users can just do whichever they want themselves in their python code is the most practical. Maybe helper functions to do various steps/conversions, instead of options, might be less convoluted too. So, I figure I'll start with documenting these patterns, and after doing that, it'll probably be clear if there's a good way for me to do it or not really.

So, rather than trying to dump all the code and results here, I think I'll create notebooks with this, in the saspy-examples repo: https://github.com/sassoftware/saspy-examples/tree/master/Issue_examples so that we can explore this. With any luck, I'll have something worked up today to post there. You can then try things our and see what you think too.

Thanks! Tom

tomweber-sas commented 4 years ago

Cripes, I thought I posted that last one this morning; never hit commit. Well, that's ok, I now have a notebook out there with this in it. I have two cases, one where you get the dates and times as numeric and can round trip them. The second is getting them as datetimes, but converting them to other Objects in the DF that are date only and time only. Then converting them back to datetime to round trip back to SAS if you want to do that.

https://github.com/sassoftware/saspy-examples/blob/master/Issue_examples/Issue279.ipynb

Both cases work, as far as I can see. Let me know if you see anything I missed.

Next, I need to investigate whether I can reasonably implement all of these in my code with just some simple options, or if just providing helper methods to do the conversions, and examples for each case (more simple and straight forward w/ helper funcs than what's in the notebook), would be better. I'll have to play around more and see.

Take a look at the notebook and play around with it to see if it works for all of your situations.

Thanks! Tom

tomweber-sas commented 4 years ago

@biojerm I've enhanced the 'datetimes' branch with everything that's in the Issue279.ipynb notebook referenced above. I've been cleaning it up too. I added the outfmts={} to the df2sd() so you can specify SAS formats for the output variables. That is something that ought to be there anyway, so you can assign any format to variables when you create data sets from dataframes; not limitted to D, T, and DT type columns. So, for numeric round tripping, there's nothing you have to do outside of the sd2df and df2sd methods; it's all handled.

For changing datetime64 to date or time in pandas, then changing them back if round tripping, it's just the 2 steps:

#  datetime to date or time
nat = pd.to_datetime('')
df_conv['d'] = df_conv['d'].apply(lambda x: x if x is nat else pd.Timestamp.date(x))
df_conv['t'] = df_conv['t'].apply(lambda x: x if x is nat else pd.Timestamp.time(x))

# date or time to datetime
df_conv['d'] = pd.to_datetime(df_conv['d'].astype('str'), errors='coerce')
df_conv['t'] = pd.to_datetime(df_conv['t'].astype('str'), errors='coerce')

I guess I could make helper routines for these: sas.df_dt2d() sas.df_dt2t() and then sas.df_d2dt() sas.df_t2dt(). Pass them the df and list of columns of the given type? Or, is that something left to the python programmer, with this example documented?

I think this is all looking pretty clean and hopefully what you are looking for. Tom

biojerm commented 4 years ago

Hey Tom, this looks really promising, I some stuff came up where i don't think I will be able to look at this today, but I will check it out next week

tomweber-sas commented 4 years ago

Hey, no problem. I know how it goes. FWIW, let me provide my thoughts on the conversion case; as to why I'm less inclined to incorporate it completely into saspy.

I COULD just create either the date objects or time objects in the DF I return from sd2df, if the SAS formats are only a date or time type. BUT, that results in DF columns simply of type Object. In a dataframe, a column of type Object, can have any actual type for any given row. It doesn't make every row be the same type. The only thing I can do, in df2sd, with type Object, is consider it to be a string (character) when importing it to SAS; without options or something to tell me to do something different.

With an option to tell me it's really a date or time type, for every row, I would still have to do conversions for all of the values to a datetime to then import them to SAS as date or time values. It's not my place to do those conversions in the dataframe I get passed; I should only be reading that, not manipulating it. And, if these columns don't actually convert, every row, to the datetime, then that's a problem with the objects in the dataframe, and I'd rather not have then happen in my code, I'd rather the user, who's dataframe it is, convert datetimes to/from whatever they want them to be for their python code to work with. Then, if there's conversion issues, it's with their data or code. If they want those columns to be date or time in SAS, just convert them to valid datetime64 (the pandas dtype) and use the datetimes={'date' | 'time'] on df2sd, and it's all good.

The conversions I am doing above, are just one line, and they convert between datetime64[ns] and datetime.time or datetime.date; that's easy for anyone to code themselves. There could be some other type the user would want them to be other than those, but as long as they can convert to/from the datetime64[ns] type that Pandas support, then they can round trip dates. times, and datetimes to/from SAS.

Hope that makes sense. Tom

biojerm commented 4 years ago

Hey Tom,

After playing around with the new features i think being able to declare formats with the to_df and df2sd methods is going to work for me. Thanks for implementing these features.

And, if these columns don't actually convert, every row, to the datetime, then that's a problem with the objects in the dataframe, and I'd rather not have then happen in my code, I'd rather the user, who's dataframe it is, convert datetimes to/from whatever they want them to be for their python code to work with. Then, if there's conversion issues, it's with their data or code. If they want those columns to be date or time in SAS, just convert them to valid datetime64 (the pandas dtype) and use the datetimes={'date' | 'time'] on df2sd, and it's all good.

I completely agree with you. I you put in places so far is good. Having saspy figure out times will introduce too much complexity, and I think would be fine to have the user sort out their data into the datetime format.

tomweber-sas commented 4 years ago

Cool, thanks. I'll see about merging this into master. Before building a new release with it though, I'll need to add doc for the options, as well as getting working examples in the doc of the conversion case, so users can do that (both cases really). I think this is a good enhancement so that people can just have dates or times in their dataframe, and still have a way to go to/from SAS dates, times and datetimes without polluting half of the datetime.

Thanks! Tom

xenonspace commented 4 years ago

Would it be possible to have to_df() to output some columns or even all as object/strings? Like suggested earlier: to_df(timefmt='string') .

tomweber-sas commented 4 years ago

Hey @xenonspace, the answer is yes, I could add options to let you tell me to create the columns as different types than I would by default. The trouble is that if you then tried to send that dataframe back to SAS, you wouldn't end up with the same data as you started with. I try to have my code be able to round trip data so you get what you started with. The one case this wasn't true is if you has SAS date or time variables (not datetime). Since Pandas doesn't have an equivalent type, I had to store SAS date and time in Pandas datetime. When round tripping that, you would get datetime in SAS, not date or time. But, with this enhancement, you can tell me that the pandas datetime should go to SAS as only date or time. So, now you can actually get the same data back as you started with.

If I added options to have me create the dataframe with different types, then I'd need to add options for the round trip (df2sd) to tell me the types that are in the DF aren't what I expect and to convert them to something else so they get back to SAS the way they started. And, I CAN do that, but all of this becomes complicated, and problematic, if the actual data in every row of the dataframe for a given column isn't the same correct type to convert to some other type you want it to end up being. This is what I was saying in the last post or so.

So, at this point it's kinda 50/50 in my mind as to go down that path or not. The thing then that makes the difference, in my mind is this. 'You' (whoever's writing the python program) are getting a dataframe that is as close to the correct representation of the SAS data, which can then be turned back into the same SAS data. In a python program where you're programming and using dataframes, it is completely trivial to convert your data in your dataframe to whatever types you would like it to be. And, if you want to send a dataframe to SAS, you can make the columns be the expected types so the SAS data set ends up being correct (what you wanted). This is why I am inclined to let the python programmer manipulate their dataframe themselves however they want, after I create the SAS equivalent, and before they give me a DF to convert to a SAS data set.

So, for instance, you want all of the columns to be strings instead of the real dtypes. It's one line of python code. Same to change any given column, on it's own. I have those lines of code in the previous post too, but here's some examples.

>>> cars = sas.sasdata('cars', 'sashelp')
>>> cars.head()
    Make           Model   Type Origin DriveTrain   MSRP  Invoice  EngineSize  Cylinders  Horsepower  MPG_City  MPG_Highway  Weight  Wheelbase  Length
0  Acura             MDX    SUV   Asia        All  36945    33337         3.5          6         265        17           23    4451        106     189
1  Acura  RSX Type S 2dr  Sedan   Asia      Front  23820    21761         2.0          4         200        24           31    2778        101     172
2  Acura         TSX 4dr  Sedan   Asia      Front  26990    24647         2.4          4         200        22           29    3230        105     183
3  Acura          TL 4dr  Sedan   Asia      Front  33195    30299         3.2          6         270        20           28    3575        108     186
4  Acura      3.5 RL 4dr  Sedan   Asia      Front  43755    39014         3.5          6         225        18           24    3880        115     197
>>> df_cars = cars.to_df()
>>> df_cars.dtypes
Make            object
Model           object
Type            object
Origin          object
DriveTrain      object
MSRP             int64
Invoice          int64
EngineSize     float64
Cylinders      float64
Horsepower       int64
MPG_City         int64
MPG_Highway      int64
Weight           int64
Wheelbase        int64
Length           int64
dtype: object
>>>
>>>
>>># now I want a dataframe that is all strings, not regular types 
>>> df_cars_str = df_cars.astype(str)
>>> df_cars_str.head()
    Make           Model   Type Origin DriveTrain   MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0  Acura             MDX    SUV   Asia        All  36945   33337        3.5       6.0        265       17          23   4451       106    189
1  Acura  RSX Type S 2dr  Sedan   Asia      Front  23820   21761        2.0       4.0        200       24          31   2778       101    172
2  Acura         TSX 4dr  Sedan   Asia      Front  26990   24647        2.4       4.0        200       22          29   3230       105    183
3  Acura          TL 4dr  Sedan   Asia      Front  33195   30299        3.2       6.0        270       20          28   3575       108    186
4  Acura      3.5 RL 4dr  Sedan   Asia      Front  43755   39014        3.5       6.0        225       18          24   3880       115    197
>>> df_cars_str.dtypes
Make           object
Model          object
Type           object
Origin         object
DriveTrain     object
MSRP           object
Invoice        object
EngineSize     object
Cylinders      object
Horsepower     object
MPG_City       object
MPG_Highway    object
Weight         object
Wheelbase      object
Length         object
dtype: object
>>> 

# or just convert one column to str

>>> df_cars.dtypes
Make            object
Model           object
Type            object
Origin          object
DriveTrain      object
MSRP             int64
Invoice          int64
EngineSize     float64
Cylinders      float64
Horsepower       int64
MPG_City         int64
MPG_Highway      int64
Weight           int64
Wheelbase        int64
Length           int64
dtype: object
>>> df_cars['EngineSize'][:5]
0    3.5
1    2.0
2    2.4
3    3.2
4    3.5
Name: EngineSize, dtype: float64
>>>
>>> df_cars['EngineSize'] = df_cars['EngineSize'].astype(str)
>>> df_cars.dtypes
Make            object
Model           object
Type            object
Origin          object
DriveTrain      object
MSRP             int64
Invoice          int64
EngineSize      object
Cylinders      float64
Horsepower       int64
MPG_City         int64
MPG_Highway      int64
Weight           int64
Wheelbase        int64
Length           int64
dtype: object
>>> df_cars['EngineSize'][:5]
0    3.5
1    2.0
2    2.4
3    3.2
4    3.5
Name: EngineSize, dtype: object
>>>

But if you sent that df_cars_str to df2sd, you would get only character columns, not the correct types. Then you have to provide options to tell me what types those columns are really supposed to be, Then I have to try to convert them, and if there any error, then it's back out of my hands. Then I can send them to SAS to get the data set. Whereas, if the python programmer converted whatever data they have into the correct pandas dtypes, then there's no need for me to have to be doing that. It's one line of code for the python programmer, and a bunch of complicated code for me to try to do all of that.

So, that's why I'm not really inclined to go down that path. But, I'm not refusing. Just have to have a more compelling reason. So, what do you think, and/or what are you trying to do and is there something problematic with you converting/manipulating the data in your dataframe to be whatever you want it to be?

Thanks! Tom

tomweber-sas commented 4 years ago

@xenonspace I've been looking at this a little more, and it turns out that you can already do this; almost :) For the CSV and DISK versions of sd2df, I use the Pandas read_csv() method in import the data. The usual case is that I set the dtypes of the dataframe based upon the SAS types and/or formats. But, I also allow kwargs to be passed in on the method, and that can be used to pass through Pandas specific options which I then pass along to the read_csv. One pandas option I specifically look to see if you specifed, is dtype=, and if so, I (pass that to read_csv along with any others) and refrain from setting the dtypes to what I normally would. The one isse there is I was still converting datetime types after, even if you specified dtypes. But, that is fixed at master now, so if you set dtype= I don't set any types myself.

So, if you provide dtype= on the ds2df CSV or DISK invocations, you can control the data types of the dataframe I generate for you. dtype= takes either a single type, which then applies to all columns, or a dict with 'col':'type' valuse for specific columns. See the Pandas doc here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv

Here are some examples doing this with the code at master.

>>> rows = [[datetime.datetime(1965, 1, 1, 8, 0, 1), 1.0, 1.0, 'a'],
...         [datetime.datetime(1966, 1, 1, 7, 0, 2), 2.0, 2.0, 'b'],
...         [datetime.datetime(1967, 1, 1, 6, 0, 3), 3.0, 3.0, ' '],
...         [datetime.datetime(1968, 1, 1, 5, 0, 4), 4.0, 4.0,  ''],
...         [None, 5.0, 5.0, 'b'],
...         [None, 6.0, None, 'b'],
...        ]
>>>
>>> df = pd.DataFrame.from_records(rows, columns=['dt','n1','n2', 's1'])
>>>
>>>
>>> df
                   dt   n1   n2 s1
0 1965-01-01 08:00:01  1.0  1.0  a
1 1966-01-01 07:00:02  2.0  2.0  b
2 1967-01-01 06:00:03  3.0  3.0
3 1968-01-01 05:00:04  4.0  4.0
4                 NaT  5.0  5.0  b
5                 NaT  6.0  NaN  b
>>> df.dtypes
dt    datetime64[ns]
n1           float64
n2           float64
s1            object
dtype: object
>>> sd = sas.df2sd(df, results='text')
>>> sd
Libref  = WORK
Table   = _df
Dsopts  = {}
Results = text

>>> sd.head()
                                                           The SAS System                         12:25 Friday, January 24, 2020   1

                                        Obs    dt                            n1    n2    s1

                                         1     1965-01-01T08:00:01.000000     1     1    a
                                         2     1966-01-01T07:00:02.000000     2     2    b
                                         3     1967-01-01T06:00:03.000000     3     3
                                         4     1968-01-01T05:00:04.000000     4     4
                                         5                              .     5     5    b
>>> # here's the default you would get
>>> df = sd.to_df(); df; df.dtypes
                   dt  n1   n2   s1
0 1965-01-01 08:00:01   1  1.0    a
1 1966-01-01 07:00:02   2  2.0    b
2 1967-01-01 06:00:03   3  3.0  NaN
3 1968-01-01 05:00:04   4  4.0  NaN
4                 NaT   5  5.0    b
5                 NaT   6  NaN    b
dt    datetime64[ns]
n1             int64
n2           float64
s1            object
dtype: object
>>> # set all columns to string
>>> dfc = sd.to_df_CSV(dtype='str'); dfc; dfc.dtypes
                           dt n1   n2   s1
0  1965-01-01T08:00:01.000000  1    1    a
1  1966-01-01T07:00:02.000000  2    2    b
2  1967-01-01T06:00:03.000000  3    3  NaN
3  1968-01-01T05:00:04.000000  4    4  NaN
4                         NaN  5    5    b
5                         NaN  6  NaN    b
dt    object
n1    object
n2    object
s1    object
dtype: object
>>> # set only col dt to string, and let pandas default the others to what it thinks.
>>> dfc = sd.to_df_DISK(dtype={'dt':'str'}); dfc; dfc.dtypes
                           dt  n1   n2   s1
0  1965-01-01T08:00:01.000000   1  1.0    a
1  1966-01-01T07:00:02.000000   2  2.0    b
2  1967-01-01T06:00:03.000000   3  3.0  NaN
3  1968-01-01T05:00:04.000000   4  4.0  NaN
4                         NaN   5  5.0    b
5                         NaN   6  NaN    b
dt     object
n1      int64
n2    float64
s1     object
dtype: object
>>>
>>> type(dfc['dt'][0])
<class 'str'>
>>> dfc['dt'][0]
'1965-01-01T08:00:01.000000'

So, what do you think about this? Can you pull master and try it out to see if it works like you want?

Thanks! Tom

xenonspace commented 4 years ago

Hi Tom,

This looks very good and exactly what was suggested earlier.

I tested with one set:

my_data = sas.sasdata("temp", "work", results="text") my_data.head() The SAS System

 Obs    names        date

   1    testd        22.09.2019  

df = my_data.to_df_CSV(dtype='str') df names date 0 testd 2019-09-22

I was hoping that the "date" value would have been in df also 22.09.2019?

tomweber-sas commented 4 years ago

Good, yes that lets you control the output types. Hmm, though it's at the dataframe creation time. And, I adjust the formats for the variables in SAS when transferring the data over. I need to do that so Pandas can import data correctly. For instance, the only datetime format that I found that works with Pandas is the SAS E8601xxx formats. SAS has a lot of formats for datetimes but others won't import into Pandas as as dataetime. So that's why you get a different datetime format in the dataframe than what you have in the SAS output.

The way to solve this would be for me to not override the SAS formats. But then the data won't necessarily be able to be imported into the dataframe correctly; with the exception of if the given columns are to just be strings. If you wanted some things to be strings, but others to be numeric or some kind of datetime, then if the SAS format doesn't correspond to something Pandas can import as that type, it won't work.

You can already override the existing SAS formats by using the dsopts={} and specifying what formats you want for any given variable., But, my code will still then assess those formats when generating the code to transfer the data over; again, to get valid values that import. So, if I added an option that simply told me not to override the formats, then you could control the formatting of the data being transferred over, and control how Pandas imports it.

If you're controlling all of this, then it's out of my hands as to what you end up with; you have to know what you're asking for. But it would give you the control to have it be whatever you want.

For your simple case above, with everything being imported as a string, all you would need to do is add the option that says don't override the existing SAS format. Then the format for your date variable would com over as 22.09.2019 instead of 2019-09-22, and given you told Pandas to import it as a string, it would stay that way. But, if you told Pandas that was still supposed to be a datetime, then it (probably) wouldn't import correctly; I'm not sure of every single format, but I know may don't work, which is why I use the E8601xxx formats.

Likewise, there are no end of numeric formats which won't import as numeric, as they have character data embedded, like $120.99 or 1,234,567 or exponential format... Those may be what you want to see as string representations of numbers, but Pandas won't create a numeric type from them when written that way.

I can code up an option to not override the SAS formats, and let you try it out. Do you have more complicated use cases or is 'import everything as a string' your goto case? That should work without fail, but you can shoot yourself in the foot pretty easily if you override the data transfer format and the import type specifications. But, it does give you pretty much control over everything I'm doing for you. What do you think?

Thanks, Tom

tomweber-sas commented 4 years ago

Ok, I have this coded up and it's doing what I think you want. I'm not convinced 'my_fmts' will be the option name, but for now, that's the key. Here's some output using almost that same data as before, but with some formats specified that show some numeric formatting too. BTW, if you specify my-fmts= w/out dtype=, I ignore that as that won't work right. You have to define the output types if you're going to change the transfer format, and you have to get it right for the dataframe to be correct. With power comes responsibility :)

>>> rows = [[datetime.datetime(1965, 1, 1, 8, 0, 1), 10000.0, 1.30, 'a'],
...         [datetime.datetime(1966, 1, 1, 7, 0, 2), 20000.0, 2.30, 'b'],
...         [datetime.datetime(1967, 1, 1, 6, 0, 3), 30000.0, 3.30, ' '],
...         [datetime.datetime(1968, 1, 1, 5, 0, 4), 40000.0, 4.30,  ''],
...         [None,                                       5.0, 5.0,  'b'],
...         [None,                                       6.0, None, 'b'],
...        ]
>>>
>>> df = pd.DataFrame.from_records(rows, columns=['dt','n1','n2', 's1'])
>>> df
                   dt       n1   n2 s1
0 1965-01-01 08:00:01  10000.0  1.3  a
1 1966-01-01 07:00:02  20000.0  2.3  b
2 1967-01-01 06:00:03  30000.0  3.3
3 1968-01-01 05:00:04  40000.0  4.3
4                 NaT      5.0  5.0  b
5                 NaT      6.0  NaN  b
>>> df.dtypes
dt    datetime64[ns]
n1           float64
n2           float64
s1            object
dtype: object
>>>
...
>>>
>>> sd = sas.df2sd(df, datetimes={'dt':'date'}, outfmts={'dt':'mmddyy10.', 'n1': 'comma32.2', 'n2': 'dollar32.2'}, results='text')
>>> sd.columnInfo()

                                                           The SAS System                         12:16 Monday, January 27, 2020   3

                                                       The CONTENTS Procedure

                                             Alphabetic List of Variables and Attributes

                                             #    Variable    Type    Len    Format

                                             1    dt          Num       8    MMDDYY10.
                                             2    n1          Num       8    COMMA32.2
                                             3    n2          Num       8    DOLLAR32.2
                                             4    s1          Char      1

>>>
>>> sd.head()

                                                           The SAS System                         12:16 Monday, January 27, 2020   2

                  Obs            dt                        n1                        n2    s1

                   1     01/01/1965                 10,000.00                     $1.30    a
                   2     01/01/1966                 20,000.00                     $2.30    b
                   3     01/01/1967                 30,000.00                     $3.30
                   4     01/01/1968                 40,000.00                     $4.30
                   5              .                      5.00                     $5.00    b

>>>
>>> df2 = sd.to_df_DISK(dtype='str')
>>> df2
           dt     n1   n2   s1
0  1965-01-01  10000  1.3    a
1  1966-01-01  20000  2.3    b
2  1967-01-01  30000  3.3  NaN
3  1968-01-01  40000  4.3  NaN
4         NaN      5    5    b
5         NaN      6  NaN    b
>>> df2.dtypes
dt    object
n1    object
n2    object
s1    object
dtype: object
>>>
>>> df2 = sd.to_df_DISK(my_fmts=True)
my_fmts option only valid when dtype= is specified. Ignoring and using necessary formatting for data transfer.
>>> df2
          dt       n1   n2   s1
0 1965-01-01  10000.0  1.3    a
1 1966-01-01  20000.0  2.3    b
2 1967-01-01  30000.0  3.3  NaN
3 1968-01-01  40000.0  4.3  NaN
4        NaT      5.0  5.0    b
5        NaT      6.0  NaN    b
>>> df2.dtypes
dt    datetime64[ns]
n1           float64
n2           float64
s1            object
dtype: object
>>>
>>> df2 = sd.to_df_DISK(dtype='str', my_fmts=True)
>>> df2
           dt         n1     n2   s1
0  01/01/1965  10,000.00  $1.30    a
1  01/01/1966  20,000.00  $2.30    b
2  01/01/1967  30,000.00  $3.30  NaN
3  01/01/1968  40,000.00  $4.30  NaN
4         NaN       5.00  $5.00    b
5         NaN       6.00    NaN    b
>>> df2.dtypes
dt    object
n1    object
n2    object
s1    object
dtype: object
>>>

So, that last one applys the actual SAS formats on the data and transfers it over that way, then when you create them as 'str' you get the same formatted values as you would see in SAs output. Note also, that both of thes optins (dtype= and my_fmts=) are for all columns/variables. So, it's your responsibility to have all the formats and all the dtypes compatible for the whole process to work correctly.

Does this behave more like what you want, and are you good with this level of control, and responsibility?

If you don't get the two sides correct you will end up with errors like the following; which is all on you with this. Here I specify dtype of float for one of the numerics, that I formatted with embedded chars ($ and commas). This would be an expected error for this situation.

>>> df2 = sd.to_df_DISK(dtype={'n1':'float'}, my_fmts=True)
Traceback (most recent call last):
  File "pandas/_libs/parsers.pyx", line 1191, in pandas._libs.parsers.TextReader._convert_tokens
TypeError: Cannot cast array from dtype('O') to dtype('float64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/tom/github/saspy/saspy/sasdata.py", line 1148, in to_df_DISK
    return self.to_df(method='DISK', tempfile=tempfile, tempkeep=tempkeep, **kwargs)
  File "/opt/tom/github/saspy/saspy/sasdata.py", line 1102, in to_df
    return self.sas.sasdata2dataframe(self.table, self.libref, self.dsopts, method, **kwargs)
  File "/opt/tom/github/saspy/saspy/sasbase.py", line 1337, in sasdata2dataframe
    return self._io.sasdata2dataframe(table, libref, dsopts, method=method, **kwargs)
  File "/opt/tom/github/saspy/saspy/sasioiom.py", line 1551, in sasdata2dataframe
    return self.sasdata2dataframeDISK(table, libref, dsopts, **kwargs)
  File "/opt/tom/github/saspy/saspy/sasioiom.py", line 2271, in sasdata2dataframeDISK
    encoding=enc, **kwargs)
  File "/usr/lib64/python3.5/site-packages/pandas/io/parsers.py", line 702, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib64/python3.5/site-packages/pandas/io/parsers.py", line 435, in _read
    data = parser.read(nrows)
  File "/usr/lib64/python3.5/site-packages/pandas/io/parsers.py", line 1139, in read
    ret = self._engine.read(nrows)
  File "/usr/lib64/python3.5/site-packages/pandas/io/parsers.py", line 1995, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 991, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1123, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1197, in pandas._libs.parsers.TextReader._convert_tokens
ValueError: could not convert string to float: '40,000.00'
>>>

Thoughts? @biojerm Have you been following this part? Do you have any input on this enhancement? Is this something you might use too? Would it be the way you'd expect to use it?

Thanks! Tom

tomweber-sas commented 4 years ago

I've pushed this to master so you can try it out and see what you think. Be aware, I'm still trying to think up a better option name than'my_fmts' so that may change, but that's what it is for now, like in my example. so give it a try and see what you think!

Thanks, Tom

xenonspace commented 4 years ago

Hi Tom,

It worked perfectly.

Thanks!

biojerm commented 4 years ago

These enhancements look good to me. I could see my use case also wanting to replicates what I see in the sas data as strings within the pandas dataframe. So the combination of dtypes and my_fmts would solve my needs. Nice enhancement. I am good to close this issue @xenonspace, you good?

tomweber-sas commented 4 years ago

That's good to hear! Yes, it seemed you might benefit from that w/ the string types too. And thanks to both of you for helping me work through this. As I mention in my upcoming SGF paper, saspy is 'open source' not only in that users can contribute, but also in that, even if I write the code, much of what I've done is shaped by user requests. So, even if they didn't write parts of it, much of the functionality has still been determined by them. This is certainly another case in point!

Thanks, Tom

BTW, I'm working on documenting all of this, and more testing, but I will create a new release, V3.2.0, once I complete that so this will all be production.

tomweber-sas commented 4 years ago

Hey, I've built a new release with all of this in it; V3.2.0. It's on here and on pypi and will be building on conda soon. So, I'll go ahead and close this. Just let me know if there's anything else!

BTW, some doc on this here: https://sassoftware.github.io/saspy/advanced-topics.html#dates-times-and-datetimes-oh-my and here: https://sassoftware.github.io/saspy/advanced-topics.html#advanced-sd2df-and-df2sd-techniques

Thanks again, Tom