Saspy 3.6 returns None when empty dataframe is converted to Sasdataset

biojerm commented 3 years ago

Describe the bug With saspy 3.6 if an empty data frame is passed to the df2sd method None is returned instead of a dataset with no observations. In sas 3.5.3(The last version I had installed) a sas dataset with no observations was returned in the same scenario as above. In my use case, I am turning CSVs into sas7bdat files .csv>pd.DataFrame>.sas7bdat and some of the csv files have headers by no data rows. So I would think the 'correct' sas dataset would be an dataset with zero ds.obs().

To Reproduce

import saspy
import pandas as pd
import logging

logging.basicConfig()
df = pd.DataFrame({"col1": ["abc", "≥", "def"]})

sas = saspy.SASsession(cfgname="sas_u8")

print(sas)
sas_ds = sas.df2sd(df, "should_be_ok")

print(type(sas_ds))
print("###########################################\n\n")

print("The issue I am seeing is when the DataFrame is empty like below")

empty_df = pd.DataFrame(columns=["col1"])
empty_sas_ds = sas.df2sd(empty_df, "returns_none_not_empty_ds")

print(type(empty_sas_ds))

print("###########################################\n\n")
print(sas.lastlog())

Expected behavior If a data frame with no rows is passed to df2ds I would expect a dataset to be created with the same columns and no headers

Screenshots

Access Method         = STDIO
SAS Config name       = sas_u8
SAS Config file       = /scratch/jlabarge/saspy_sandbox/sascfg_personal.py
WORK Path             = /tmp/SAS_work7065000076C7_statsrv/
SAS Version           = 9.04.01M2P07232014
SASPy Version         = 3.5.3
Teach me SAS          = False
Batch                 = False
Results               = Pandas
SAS Session Encoding  = utf-8
Python Encoding value = utf_8
SAS process Pid value = 30407

<class 'saspy.sasdata.SASdata'>
###########################################

The issue I am seeing is when the DataFrame is empty like below
<class 'saspy.sasdata.SASdata'>
###########################################

53   data 'returns_none_not_empty_ds'n;
54   length 'col1'n $nan;
54   length 'col1'n $nan;
                     ---
                     391
ERROR 391-185: Expecting a variable length specification.

54   length 'col1'n $nan;
                     ---
                     202
ERROR 202-322: The option or parameter is not recognized and will be ignored.

55   infile datalines delimiter='03'x  STOPOVER;
56   input @;
57   if _infile_ = '' then delete;
58   input 'col1'n ;
59    'col1'n = translate('col1'n, '0A'x, '01'x);
60    'col1'n = translate('col1'n, '0D'x, '02'x );
61   ;
62   datalines4;
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.RETURNS_NONE_NOT_EMPTY_DS may be incomplete.  When this step was stopped there were 0 observations and 1
         variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

63   ;;;;
64   ;*';*";*/;
65   run;
66
67   ;*';*";*/;%put %upcase(e3969440a681a2408885998500000007);
E3969440A681A2408885998500000007
68   ;*';*";*/;
69   data _null_; e = exist("'returns_none_not_empty_ds'n");
70   v = exist("'returns_none_not_empty_ds'n", 'VIEW');
71    if e or v then e = 1;
72   put 'TABLE_EXISTS=' e 'TAB_EXTEND=';run;
TABLE_EXISTS=1 TAB_EXTEND=
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

73
74   ;*';*";*/;%put %upcase(e3969440a681a2408885998500000008);
E3969440A681A2408885998500000008
75   ;*';*";*/;
76   data _null_; e = exist("user.'returns_none_not_empty_ds'n");
77   v = exist("user.'returns_none_not_empty_ds'n", 'VIEW');
78    if e or v then e = 1;
79   put 'TABLE_EXISTS=' e 'TAB_EXTEND=';run;
TABLE_EXISTS=0 TAB_EXTEND=
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

80
81   ;*';*";*/;%put %upcase(e3969440a681a2408885998500000009);
E3969440A681A2408885998500000009

tomweber-sas commented 3 years ago

Hey Jeremy, good catch. This is a regression due to the performance enhancement to the STDIO access method as part of the rework for the char column lengths enhancements in 3.6.0. Actually, it exposes a bug in the code gen, which just happens to not cause a failure (even though there the error), and you still get an empty dataset. This still happens to work with the other access methods, but I have a fix to make this correct for all. I had a check for the col len being 0 from the calculation, and that would then set it to 8; the default length for an empty char column in SAS. But, I'm not getting 0, I'm getting nan, and that's where that syntax error is coming from. That syntax error doesn't stop the datastep and fail, but rather it succeeds with an empty table. This is still happening in the other access methods, but with the fix in STDIO, it's now causing an error that didn't used to happen. I've just changed my check for not getting a length to account for nan, or 0 ... and then set it to the default 8 and the datastep now runs correctly with the correct signature, so you get the empty dataset with correctly defined, default, char col lengths.

I've pushed this to main, can you give it a try and see that it works correctly for you now too?

Thanks! Tom

biojerm commented 3 years ago

Hey Tom,

The update makes it so that can write empty datasets and a saspy.dataset is returned, however, when I try to open an empty dataset that is written to disk (not WORK) I am getting this message. If there is data, I don't have an issue opening the file.

(I am using the SASUniveralViewer)

at the risk of being redundant here is the code I am running, i added some write out steps in the end:

import saspy
import pandas as pd
import logging

logging.basicConfig()
df = pd.DataFrame({"col1": ["abc", "≥", "def"]})

output_path = "/scratch/jlabarge"

sas = saspy.SASsession(cfgname="sas_u8")

print(sas)

sas_ds = sas.df2sd(df, "should_be_ok")
print("sas_ds type: ", type(sas_ds))
print("###########################################\n\n")

# print("The issue I am seeing is when the DataFrame is empty like below")
empty_df = pd.DataFrame(columns=["col1"])
empty_sas_ds = sas.df2sd(empty_df, "returns_none_not_empty_ds")

print("empty_sas_ds type: ", type(empty_sas_ds))

print("###########################################\n\n")
# print(sas.lastlog())

print("writing datasets to disk")
sas.saslib("empty", path=output_path)
written_empty_ds = sas.df2sd(empty_df, libref="empty", table="i_am_empty")

print("written_empty_ds type: ", type(written_empty_ds))

sas.saslib("empty", path=output_path)
written_empty_ds = sas.df2sd(df, libref="empty", table="i_am_have_data")
# print(sas.saslog())

biojerm commented 3 years ago

One more random thing I ran into was that the libref cannot have underscores. Has that always been true and I never noticed?

tomweber-sas commented 3 years ago

Hey Jeremy, I've tried out that code, and it seems to work; as far as sas is concerned (saspy and sas itself). I don't have the universal viewer to test out; that works on these datasets before?

>>> sas.saslib("empty", path=output_path)

280
281  libname empty    '/tmp/x'  ;
NOTE: Library EMPTY does not exist.
282
>>> sas.saslib("empty", path=output_path)

284
285  libname empty    '/tmp/x'  ;
NOTE: Libref EMPTY was successfully assigned as follows:
      Engine:        V9
      Physical Name: /tmp/x
286
>>> written_empty_ds = sas.df2sd(empty_df, libref="empty", table="i_am_empty")
>>>
>>> written_empty_ds
Libref  = empty
Table   = i_am_empty
Dsopts  = {}
Results = Pandas

>>> written_empty_ds.head()
Empty DataFrame
Columns: [col1]
Index: []
>>>
>>> written_empty_ds.columnInfo()
             Member  Num Variable  Type  Len  Pos
0  EMPTY.I_AM_EMPTY    1     col1  Char    8    0
>>>

sas
[...]

NOTE: SAS initialization used:
      real time           1.07 seconds
      cpu time            0.17 seconds

  1? libname empty    '/tmp/x'  ;

NOTE: Libref EMPTY was successfully assigned as follows:
      Engine:        V9
      Physical Name: /tmp/x

  2? proc print data=empty.i_am_empty; run;

NOTE: No observations in data set EMPTY.I_AM_EMPTY.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.69 seconds
      cpu time            0.34 seconds

  3?

and underscores are valid in libref:

>>> sas.saslib("em_pty", path=output_path)

680
681  libname em_pty    '/tmp/x'  ;
NOTE: Libref EM_PTY refers to the same physical library as EMPTY.
NOTE: Libref EM_PTY was successfully assigned as follows:
      Engine:        V9
      Physical Name: /tmp/x
682
>>> sas.saslib("_empty", path=output_path)

684
685  libname _empty    '/tmp/x'  ;
NOTE: Libref _EMPTY refers to the same physical library as EM_PTY.
NOTE: Libref _EMPTY was successfully assigned as follows:
      Engine:        V9
      Physical Name: /tmp/x
686
>>>

biojerm commented 3 years ago

interesting.

I will keep digging with the the unable to open problem.

With the underscores I am getting this:

7
88   libname who_is_empty    '/scratch/jlabarge'  ;
ERROR: who_is_empty is not a valid SAS name.
ERROR: Error in the LIBNAME statement.
89
The libref specified is not assigned in this SAS Session.

might be something with our sas install/config. I know we have other programs that have funny behaviors with underscores.

tomweber-sas commented 3 years ago

Oh, that's just because it's too long. only 8 chars for a libref. But, is the fix working ok for you? That error from the universal viewer seem to say it can't support a dataset with no rows. But the rest of saspy is working as expected with this empty data set?

biojerm commented 3 years ago

Ahhh, did not know there was an 8 char limit on librefs.

Yeah the rest seems to be working as expected on the saspy/sas front. I can also confirm that the same BOF/EOF message pop us with the previous version of saspy. So it looks more like a limit of the UniversalViewer

I am ok to close the issue.

tomweber-sas commented 3 years ago

Cool man! Yes it all looks good to me. I have 2 other small fixes at main now too. I think I'll build a new version with these 3 fixes. Then you've got a clean prod version to install and use. I'll close this, and, I just built a new version : 3.6.1 Again, conda will take a day'ish to be built, but saspy repo and pypi have 3.6.1 now.

Thanks, Tom

sassoftware / saspy

Saspy 3.6 returns None when empty dataframe is converted to Sasdataset #335