sassoftware / saspy

A Python interface module to the SAS System. It works with Linux, Windows, and Mainframe SAS as well as with SAS in Viya.
https://sassoftware.github.io/saspy
Other
374 stars 149 forks source link

Issue writing large pandas dataframe to SAS dataset #276

Closed biojerm closed 4 years ago

biojerm commented 4 years ago

Hi, I am having trouble passing a large pandas DataFrame to the sas kernel using the df2sd method. For testing purposes, I am trying to write read a SAS dataset into a pandas DataFrame and then write it back out into a SAS dataset (for reference the SAS dataset is about ~560 mb). I have a toy script that can write small pandas DataFrames to SAS datasets. However, when I tried to create a SAS dataset from a large pandas DataFrame, I noticed the df2sd step seems to hang. For some reason 91 rows seems to be the limit. When only 91 rows of data are sent to the SAS kernel, it is pretty quick. However when 91+ rows are sent to the kernel the program hangs (greater that 1 hr before I got impatient). Below is the output from the ipython session I was running.

In [64]: sas_df = pd.read_sas('l0NNXXX.sas7bdat')
In [65]: sas_df.shape
Out[65]: (133361, 118)

In [66]: sas = saspy.SASsession(cfgfile='~/sascfg_personal.py', cfgname='mycfg')
SAS Connection established. Subprocess id is 91265

No encoding value provided. Will try to determine the correct encoding.
Setting encoding to latin_1 based upon the SAS session encoding value of latin1.

In [67]: sas_ds = sas.df2sd(sas_df, 'test_output2')
^C---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-67-f5572529564f> in <module>
----> 1 sas_ds = sas.df2sd(sas_df, 'test_output2')

/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasbase.py in df2sd(self, df, table, libref, results, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
   1011         :return: SASdata object
   1012         """
-> 1013         return self.dataframe2sasdata(df, table, libref, results, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
   1014
   1015     def dataframe2sasdata(self, df: 'pandas.DataFrame', table: str = '_df', libref: str = '',

/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasbase.py in dataframe2sasdata(self, df, table, libref, results, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
   1045             return None
   1046         else:
-> 1047             self._io.dataframe2sasdata(df, table, libref, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
   1048
   1049         if self.exist(table, libref):

/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasiostdio.py in dataframe2sasdata(self, df, table, libref, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
   1450             if col < (ncols-1):
   1451                card += colsep
-> 1452          self.stdin.write(card.encode(self.sascfg.encoding)+b'\n')
   1453          #self._asubmit(card, "text")
   1454

KeyboardInterrupt:

In [68]: sas = saspy.SASsession(cfgfile='~/sascfg_personal.py', cfgname='mycfg')
SAS Connection established. Subprocess id is 92266

No encoding value provided. Will try to determine the correct encoding.
Setting encoding to latin_1 based upon the SAS session encoding value of latin1.

In [69]: sas_ds = sas.df2sd(sas_df[:91], 'test_output2')
In [70]:

here is my sascfg_personal.py

SAS_config_names = ['mycfg']
SAS_config_options = {'lock_down': False,
                      'verbose' : True
                      }

mycfg = {'saspath': '/usr/local/apps/bin/sas'}

and my saspy version:

In [71]: saspy.__version__
Out[71]: '3.1.7'

Thanks in advance!

tomweber-sas commented 4 years ago

Hey, I can help with this. My initial thought is that it's not about how many rows, but rather the contents of the data. I could use a little more info to start with. Can you submit your SASsession object (sas) so I can see that output; will look something like this:

>>> sas
Access Method         = STDIO
SAS Config name       = sdssas
SAS Config file       = /opt/tom/github/saspy/saspy/sascfg_personal.py
WORK Path             = /sastmp/SAS_workF33A000052DC_tom64-5/
SAS Version           = 9.04.01M6D11072018
SASPy Version         = 3.1.9
Teach me SAS          = False
Batch                 = False
Results               = Pandas
SAS Session Encoding  = latin9
Python Encoding value = iso8859_15
SAS process Pid value = 21212

And also submit sas_df.dtypes, so I can see what types the data frame has. When you say you can send 91 rows but not more, what have you tried? Could it be something about the contents of row 92 that it causing a problem? or can you send more than 91 of a different range of rows from the data? df2sd generates code to send delimited rows of data to SAS and uses low hex value characters as the delimiter, x'03', as this isn't a value that would appear in character data. Does this data have binary data in character columns? The other data problem could be if the character columns have newlines in them. That can cause issues too, and there's an option on df2sd to address that; embedded_newlines=True. Could try that to see if that could be the problem.

The next thing I would try is using SAS to read that dataset. Is there any issue with that? Can you use sd2df to read it and create the data frame? If so, does sd2sd then work on that dataframe?

Thanks! Tom

biojerm commented 4 years ago

Thanks for getting back! I did some more investigation

1) Had a colleague open the dataset in an interactive SAS session and there were no issues. So I don't think the .sas7bdat dataset is corrupted (so that is a good sign)

2) Here are the setting of my sas object

In [72]: sas = saspy.SASsession(cfgfile='~/sascfg_personal.py', cfgname='mycfg')
SAS Connection established. Subprocess id is 102939

No encoding value provided. Will try to determine the correct encoding.
Setting encoding to latin_1 based upon the SAS session encoding value of latin1.

In [73]: sas
Out[73]:
Access Method         = STDIO
SAS Config name       = mycfg
SAS Config file       = /home/jlabarge/sascfg_personal.py
WORK Path             = /tmp/SAS_workFB250001921C_statsrv/
SAS Version           = 9.04.01M2P07232014
SASPy Version         = 3.1.7
Teach me SAS          = False
Batch                 = False
Results               = Pandas
SAS Session Encoding  = latin1
Python Encoding value = latin_1
SAS process Pid value = 102940
]

3) here are the datatypes of the dataframe built using pd.read_sas

        'protocol', 'site', 'visit', 'spctm', 'spcode', 'filetime',
        'assay_error', 'new_record', 'result_excluded', 'any_errors',
        'id_statusassayresqc', 'pending', 'missing_sas_code',
        'id_statusassayresqc_old', 'set_failure', 'spid_bl', 'procldmsid',
        'triplet', 'community_no', 'ptid', 'alqcntoth', 'alqvoloth',
        'alqlaboth', 'alqcnt1', 'alqcnt2', 'alqcnt3', 'alqcnt4', 'alqcnt5',
        'alqvol1', 'alqvol2', 'alqvol3', 'alqvol4', 'alqvol5', 'alqlab1',
        'alqlab2', 'alqlab3', 'alqlab4', 'alqlab5', 'visit_a', 'spcol',
        'requested', 'exp_status', 'from_nt_ds', 'visit_e', 'assayexp_exclude',
        'problem_code', 'resultstat_ch'],
       dtype='object'),
 dtype('<M8[ns]'): Index(['filedate', 'reqdate', 'requestdt', 'time_status', 'procdt', 'spcdt_a',
        'spcdt_e'],
       dtype='object'),
 dtype('O'): Index(['conid', 'guspec', 'ASSAYTYPE', 'rluc', 's_coc', 'cp', 'pid', 'sid',
        'name', 'gender', 'dobc', 'assay', 'result', 'dilution', 'code',
        'module', 'caldtc', 'operator', 'releasedby', 'reagentlot', 'reagent',
        'drawdtc', 'compldtc', 'location', 'file_name_orig', 'file_name_labkey',
        'network', 'labid', 'specid', 'spectm', 'assayspec', 'specnm',
        'specode', 'csvfile', 'subtype', 'reqdatec', 'comment', 's_comment',
        'ref', 'requestby', 'requestfmt', 'excludefrom', 'file_name_merge',
        'file_ext_orig', 'status_source', 'sas_version', 'file_input',
        'file_name', 'file_dir', 'file_full', 'primstr', 'addstr', 'dervst2',
        'cohort2', 'volstr', 'pktime_a', 'pktime_e', 'problem_code_note'],
       dtype='object')}

Looks like my datatypes are: object('0'), float64, and datetime64[ns]

4) I tired a different range of columns excluding the previously problematic column and saw that 90 rows reads in fine, 91 does not

In [85]: sas = saspy.SASsession(cfgfile='~/sascfg_personal.py', cfgname='mycfg')
SAS Connection established. Subprocess id is 114524

No encoding value provided. Will try to determine the correct encoding.
Setting encoding to latin_1 based upon the SAS session encoding value of latin1.

In [86]: sas_ds = sas.df2sd(sas_df[93:93+90], 'test_output2')    # <--- load in worked quickly

In [87]: sas_ds = sas.df2sd(sas_df[93:93+91], 'test_output2')    # <--- load failed
^C---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-87-88933f55643d> in <module>
----> 1 sas_ds = sas.df2sd(sas_df[93:93+91], 'test_output2')

/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasbase.py in df2sd(self, df, table, libref, results, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
   1011         :return: SASdata object
   1012         """
-> 1013         return self.dataframe2sasdata(df, table, libref, results, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
   1014
   1015     def dataframe2sasdata(self, df: 'pandas.DataFrame', table: str = '_df', libref: str = '',

/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasbase.py in dataframe2sasdata(self, df, table, libref, results, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
   1045             return None
   1046         else:
-> 1047             self._io.dataframe2sasdata(df, table, libref, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
   1048
   1049         if self.exist(table, libref):

/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasiostdio.py in dataframe2sasdata(self, df, table, libref, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
   1453          #self._asubmit(card, "text")
   1454
-> 1455       self._asubmit(";;;;", "text")
   1456       ll = self.submit("run;", 'text')
   1457       return

/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasiostdio.py in _asubmit(self, code, results)
    529          self.stdin.write(odsclose)
    530
--> 531       self.stdin.flush()
    532
    533       return str(out)

KeyboardInterrupt:

In [88]:

5) Tried a smaller subset of data that encompassed the rows that were not loading previously. And the this worked, so I don't think there is an issue with the row of data specifically.

SAS Connection established. Subprocess id is 127914

No encoding value provided. Will try to determine the correct encoding.
Setting encoding to latin_1 based upon the SAS session encoding value of latin1.

In [105]: sas_ds = sas.df2sd(sas_df[80:100], 'test_output2')

In [106]:

6) Tried to use the embedded_newlines option.

SAS Connection established. Subprocess id is 117641

No encoding value provided. Will try to determine the correct encoding.
Setting encoding to latin_1 based upon the SAS session encoding value of latin1.

In [91]: sas_ds = sas.df2sd(sas_df[93:93+91], 'test_output2', embedded_newlines=True)
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-91-06c4ff4de519> in <module>
----> 1 sas_ds = sas.df2sd(sas_df[93:93+91], 'test_output2', embedded_newlines=True)

/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasbase.py in df2sd(self, df, table, libref, results, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
   1011         :return: SASdata object
   1012         """
-> 1013         return self.dataframe2sasdata(df, table, libref, results, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
   1014
   1015     def dataframe2sasdata(self, df: 'pandas.DataFrame', table: str = '_df', libref: str = '',

/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasbase.py in dataframe2sasdata(self, df, table, libref, results, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
   1045             return None
   1046         else:
-> 1047             self._io.dataframe2sasdata(df, table, libref, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
   1048
   1049         if self.exist(table, libref):

/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasiostdio.py in dataframe2sasdata(self, df, table, libref, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
   1453          #self._asubmit(card, "text")
   1454
-> 1455       self._asubmit(";;;;", "text")
   1456       ll = self.submit("run;", 'text')
   1457       return

/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasiostdio.py in _asubmit(self, code, results)
    529          self.stdin.write(odsclose)
    530
--> 531       self.stdin.flush()
    532
    533       return str(out)

7) Tried to use the saspy sd2df method, and this did create a df. I only gave it 5-10 min though, so if I need to wait longer please let me know:

In [102]: all_sas_df = sas.sd2df('NNNNXXX.sas7bdat')
^C---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-102-fbd6dbb693d4> in <module>
----> 1 all_sas_df = sas.sd2df('NNNNXXX.sas7bdat')

/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasbase.py in sd2df(self, table, libref, dsopts, method, **kwargs)
   1094         """
   1095         dsopts = dsopts if dsopts is not None else {}
-> 1096         return self.sasdata2dataframe(table, libref, dsopts, method, **kwargs)
   1097
   1098     def sd2df_CSV(self, table: str, libref: str = '', dsopts: dict = None, tempfile: str = None,

/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasbase.py in sasdata2dataframe(self, table, libref, dsopts, method, **kwargs)
   1236
   1237         dsopts = dsopts if dsopts is not None else {}
-> 1238         if self.exist(table, libref) == 0:
   1239             print('The SAS Data Set ' + libref + '.' + table + ' does not exist')
   1240             return None

/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasbase.py in exist(self, table, libref)
    704         :rtype: bool
    705         """
--> 706         return self._io.exist(table, libref)
    707
    708     def sasets(self) -> 'SASets':

/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasiostdio.py in exist(self, table, libref)
    915       code += "put 'TABLE_EXISTS=' e 'TAB_EXTEND=';run;"
    916
--> 917       ll = self.submit(code, "text")
    918
    919       exists = int(ll['LOG'].rpartition("TABLE_EXISTS=")[2].rpartition(" TAB_EXTEND=")[0])

/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasiostdio.py in submit(self, code, results, prompt)
    632
    633       out = self.stdin.write(b'\n'+logcodei.encode(self.sascfg.encoding)+b'\n')
--> 634       self.stdin.flush()
    635
    636       while not done:

KeyboardInterrupt:

8) I also have access to SAS_UTF8 in house, so I updated my config and tried using that with no success.

In [110]: sas
Out[110]:
Access Method         = STDIO
SAS Config name       = mycfg
SAS Config file       = /home/jlabarge/sascfg_personal.py
WORK Path             = /tmp/SAS_work31FC00003A38_statsrv/
SAS Version           = 9.04.01M2P07232014
SASPy Version         = 3.1.7
Teach me SAS          = False
Batch                 = False
Results               = Pandas
SAS Session Encoding  = utf-8
Python Encoding value = utf_8
SAS process Pid value = 14904

In [111]: sas_df.shape
Out[111]: (133361, 118)

In [112]: sas_ds = sas.df2sd(sas_df[1:100], 'test_output2')
^C---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-112-b804163a763f> in <module>
----> 1 sas_ds = sas.df2sd(sas_df[1:100], 'test_output2')

/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasbase.py in df2sd(self, df, table, libref, results, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
   1011         :return: SASdata object
   1012         """
-> 1013         return self.dataframe2sasdata(df, table, libref, results, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
   1014
   1015     def dataframe2sasdata(self, df: 'pandas.DataFrame', table: str = '_df', libref: str = '',

/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasbase.py in dataframe2sasdata(self, df, table, libref, results, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
   1045             return None
   1046         else:
-> 1047             self._io.dataframe2sasdata(df, table, libref, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
   1048
   1049         if self.exist(table, libref):

/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasiostdio.py in dataframe2sasdata(self, df, table, libref, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
   1423          code += "format "+format+";\n"
   1424       code += "infile datalines delimiter="+delim+" DSD STOPOVER;\n input "+input+";\n"+xlate+";\n datalines4;"
-> 1425       self._asubmit(code, "text")
   1426
   1427       for row in df.itertuples(index=False):

/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasiostdio.py in _asubmit(self, code, results)
    524          self.stdin.write(odsopen)
    525
--> 526       out = self.stdin.write(code.encode(self.sascfg.encoding)+b'\n')
    527
    528       if (ods):

KeyboardInterrupt:

In [113]:

Thanks again, for helping. Please let me know if you have any ideas or other troubleshooting steps I can try.

Best, Jeremy

tomweber-sas commented 4 years ago

Good job trying different things out!

For the sd2df case, I should have provided code example for that. it doesn't work doing all_sas_df = sas.sd2df('NNNNXXX.sas7bdat')

You would need to assign the library where that data set is, in the SAS session, and then use sd2df with the table name and the libref.

sas.saslib('mylib', path='/wherever/that/dataset/lives')
# lets see what SAS thinks this is
sd = sas.sasdata('NNNNXXX', 'mylib', results='html')
sd.contents()
# try sd2df
df = sas.sd2df('NNNNXXXX', 'mylib')  # or, as we have the SASdata object, just -> df = sd.to_df()
# if it ran, look at the dataframe
df.head()
df.dtypes

That would run the sd2fd case.

The other thing would be, run the case that works, less then 92 rows, and then submit print(sas.saslog()) so that we can see the code generated for df2sd; maybe there's something there I will see.

I will dig into this more tomorrow too.

And, I'm not sure I understand the output from the dtypes on that DF the pandas method created. when I run dtypes, it looks more like this:

>>> df = sas.sd2df('cars','sashelp')
>>> df.dtypes
Make            object
Model           object
Type            object
Origin          object
DriveTrain      object
MSRP             int64
Invoice          int64
EngineSize     float64
Cylinders      float64
Horsepower       int64
MPG_City         int64
MPG_Highway      int64
Weight           int64
Wheelbase        int64
Length           int64
dtype: object
>>>

Which shows each columns data type. I'm more curious now if reading the dataset w/ SAS and moving it back and forth works. sd2fd, df2sd, and if there's something different about the dataframe the pandas thing creates that is somehow nor working with what I'm doing. ???

Any chance of a web-ex to do some quicker exploration? I can set that up and maybe get at this quicker if that's ok on your side.

Thanks! Tom

biojerm commented 4 years ago

sorry about the datatypes I tried to get clever and do a group by. I have too many columns for the full output but here is what ipython shows. Overall the datatypes are object, float64 and datetime64[ns].


compldtc                     object
location                     object
id_file                     float64
id_replace_file             float64
                          ...
volstr                       object
alqcnt1                     float64
alqcnt2                     float64
alqcnt3                     float64
alqcnt4                     float64
alqcnt5                     float64
alqvol1                     float64
alqvol2                     float64
alqvol3                     float64
alqvol4                     float64
alqvol5                     float64
alqlab1                     float64
alqlab2                     float64
alqlab3                     float64
alqlab4                     float64
alqlab5                     float64
visit_a                     float64
spcdt_a              datetime64[ns]
pktime_a                     object
spcol                       float64
requested                   float64
exp_status                  float64
from_nt_ds                  float64
visit_e                     float64
spcdt_e              datetime64[ns]
pktime_e                     object
assayexp_exclude            float64
problem_code                float64
problem_code_note            object
resultstat_ch               float64

I will try the sd2df again and post the results

biojerm commented 4 years ago

I was able to read the dataset into sas using your suggestion above. I also added in an option to ignore FormatErrors. However when I ran the sd.to_df() method it hung. Here is a snippet of the log with the 'ERROR:' (s).

NOTE: 119 records were written to the file STDERR.
      The minimum record length was 1.
      The maximum record length was 9.
NOTE: There were 0 observations read from the data set MYLIB.NNNXXX.
NOTE: There were 1 observations read from the data set WORK._N_U_L_L_.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      cpu time            0.02 seconds

382  proc delete data=work._n_u_l_l_;run;
NOTE: Deleting WORK._N_U_L_L_ (memtype=DATA).
NOTE: PROCEDURE DELETE used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

383  
384  ;*';*";*/;
385  %put %upcase(e3969440a681a2408885998500000028);
E3969440A681A2408885998500000028
386  filename sock socket ':54793' lrecl=1048576 recfm=v termstr=LF;
NOTE: Data file MYLIB.NNNXXX.DATA is in a format that is native to another host, or the file encoding does not match the session 
      encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce 
      performance.
387   data _null_; set mylib.'l071hca'n ;
388   file sock dlm='02'x ; put 'conid'n 'guspec'n 'ldmsid'n best32. '02'x 'ASSAYTYPE'n 'rluc'n 's_coc'n 'recnum'n best32. '02'x
388! 'cp'n 'pid'n 'sid'n 'name'n 'gender'n 'dobc'n 'assay'n 'rlu'n best32. '02'x 's_co'n best32. '02'x 'result'n 'dilution'n 'code'n
388!  'module'n 'caldtc'n 'operator'n 'releasedby'n 'reagentlot'n 'reagent'n 'drawdtc'n 'compldtc'n 'location'n 'id_file'n best32.
388! dule'n 'caldtc'n 'operator'n 'releasedby'n 'reagentlot'n 'reagent'n 'drawdtc'n 'compldtc'n 'location'n 'id_file'n best32. '02'x
388!  'id_replace_file'n best32. '02'x 'file_name_orig'n 'file_name_labkey'n 'network'n 'protocol'n best32. '02'x 'labid'n 'specid'n
388!  'site'n best32. '02'x 'visit'n best32. '02'x 'spectm'n 'spctm'n E8601TM15.6 '02'x 'assayspec'n 'specnm'n 'specode'n 'spcode'n
388! best32. '02'x 'csvfile'n 'subtype'n 'filedate'n E8601DA10. '02'x 'filetime'n E8601TM15.6 '02'x 'reqdatec'n 'reqdate'n
388! E8601DA10. '02'x 'comment'n 'assay_error'n best32. '02'x 'new_record'n best32. '02'x 's_comment'n 'ref'n 'requestby'n
388! 'requestdt'n E8601DA10. '02'x 'requestfmt'n 'excludefrom'n 'result_excluded'n best32. '02'x 'any_errors'n best32. '02'x
388! 'file_name_merge'n 'file_ext_orig'n 'time_status'n E8601DT26.6 '02'x 'status_source'n 'id_statusassayresqc'n best32. '02'x
388! 'sas_version'n 'file_input'n 'file_name'n 'file_dir'n 'file_full'n 'pending'n best32. '02'x 'missing_sas_code'n best32. '02'x
388! 'id_statusassayresqc_old'n best32. '02'x 'set_failure'n best32. '02'x 'spid_bl'n best32. '02'x 'primstr'n 'addstr'n 'dervst2'n
388! 'procldmsid'n best32. '02'x 'procdt'n E8601DA10. '02'x 'triplet'n best32. '02'x 'community_no'n best32. '02'x 'cohort2'n
388! 'ptid'n best32. '02'x 'alqcntoth'n best32. '02'x 'alqvoloth'n best32. '02'x 'alqlaboth'n best32. '02'x 'volstr'n 'alqcnt1'n
388! best32. '02'x 'alqcnt2'n best32. '02'x 'alqcnt3'n best32. '02'x 'alqcnt4'n best32. '02'x 'alqcnt5'n best32. '02'x 'alqvol1'n
388! best32. '02'x 'alqvol2'n best32. '02'x 'alqvol3'n best32. '02'x 'alqvol4'n best32. '02'x 'alqvol5'n best32. '02'x 'alqlab1'n
388! best32. '02'x 'alqlab2'n best32. '02'x 'alqlab3'n best32. '02'x 'alqlab4'n best32. '02'x 'alqlab5'n best32. '02'x 'visit_a'n
388! best32. '02'x 'spcdt_a'n E8601DA10. '02'x 'pktime_a'n 'spcol'n best32. '02'x 'requested'n best32. '02'x 'exp_status'n best32.
388! '02'x 'from_nt_ds'n best32. '02'x 'visit_e'n best32. '02'x 'spcdt_e'n E8601DA10. '02'x 'pktime_e'n 'assayexp_exclude'n best32.
388! '02'x 'problem_code'n best32. '02'x 'problem_code_note'n 'resultstat_ch'n best32. '02'x '01'x; run;
NOTE: Format $ASSAY_LABID_NAME was not found or could not be loaded.
NOTE: Format SPNAMECRF was not found or could not be loaded.
NOTE: Format NOYESNA was not found or could not be loaded.
NOTE: Format ASSAYQC was not found or could not be loaded.
NOTE: Format RLSTAT was not found or could not be loaded.
NOTE: Format $ASSAY_LABID_NAME was not found or could not be loaded.
NOTE: The file SOCK is:
      Local Host Name=comp,
      Local Host IP addr=127.0.1.1,
      Peer Hostname Name=comp.pc.host.org,
      Peer IP addr=127.0.1.1,Peer Name=N/A,
      Peer Portno=54793,Lrecl=1048576,Recfm=Variable

ERROR: Broken pipe.
FATAL: Unrecoverable I/O error detected in the execution of the DATA step program.  Aborted during the EXECUTION phase.
NOTE: 104347 records were written to the file SOCK.
      The minimum record length was 1980.
      The maximum record length was 2184.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 104348 observations read from the data set MYLIB.NNNXXX.
NOTE: DATA statement used (Total process time):
      real time           1:12.43
      cpu time            3.76 seconds

389  
390  ;*';*";*/;
391  
392  
393  ;*';*";*/;
394  %put %upcase(e3969440a681a2408885998500000029);
E3969440A681A2408885998500000029  

Let me know if the full log would be useful. Also as an FYI I let the process run for a minute or two and then killed it with ^C

Edits for privacy

tomweber-sas commented 4 years ago

Ah, that makes sense about the dtypes. So the sd2df looks like it was running. I'm guessing that the broken pipe was because you killed it. It was 104347 records out of 133361 and 72 seconds in to it. Can you let that run and see if it finishes? Also, you can use the sd2df_DISK() version to improve performance for this step. I see that this data set isn't from this same system that you're running on. That could have something to do with this. Can you run the following so we can see more about this?

sas.saslib('mylib', path='/wherever/that/dataset/lives')
# lets see what SAS thinks this is
sd = sas.sasdata('NNNNXXX', 'mylib', results='text')
sd.contents()
# try sd2df or sd2df_DISK
df = sas.sd2df('NNNNXXXX', 'mylib')  # or, df = sd.to_df_DISK()  which should be faster
# if it ran, look at the dataframe
df.head()
df.dtypes

Also, can we see the log from a df2sd that worked? Might be something useful in that. Oh, and I'm not sure the _DISK version is in 3.1.7, I added that recently, so to use it, you might want to grab the current version. 3.19. The _DISK version of sd2df is faster for lager data sets than the default one, just like the _CSV is, but handles special case data better than CSV.

Thanks! Tom

biojerm commented 4 years ago

Could you clarify your concern with the different file systems? I am SSH-ing into a Unix system that hosts our SAS lic. And all of my commands are being executed there.

I was able to get the sd2df to work this morning, so that is a good sign. Took about 3-5min. I will look into if we can install the SasPy 3.19 on our systems.

I am working on generating logs for the df2sd that worked, Unfortunately I cannot just post the whole log because or my company/patient privacy concerns.

So I have a baseline, what size/complexity datasets do you see performance hits? We are going to have datasets that are hundreds of mb to a few Gb in size, with many 100+ columns. Do these qualify as 'large'

tomweber-sas commented 4 years ago

In the wrong order :)

You can email logs directly to me, instead of posting them and having to take the time to obfuscate them. If you would be allowed to send something to our Tech Support, for instance, it can come to me. It's all confidential.

The message

NOTE: Data file MYLIB.NNNXXX.DATA is in a format that is native to another host, or the file encoding does not match the session 
      encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce 
      performance.

means that the SAS data set was created by a SAS session that wasn't the same system or architecture that the SAS session accessing it is running on. So, the engine has to convert the contents of it to this systems architecture on input. The things that can be different are the endian-nes (byte swapped or not), The offsets and length of numeric types. Character encoding, is a different thing, but that's also something to understand too. So, it's conceivable, that if the pandas code reading that dataset doesn't take all of that into account, then some of what's in that dataframe may be corrupt. That's why I wanted to use saspy (SAS) to read in this data set and create a datafram that way, so we can see if that works round trip, and compare against the data frame you're getting from the pandas method. I don't know anything about the data itself, so I'm still guessing at reasons this would be failing. It still seems like it's data specific, because if it can do 91 rows, it can do 92 rows. Unless something about what's in some row is causing a problem.

Since sd2df did read in and create the DF. Can you then df2sd that dataframe back? And, was there anything interesting in the log from that? I'm guessing it looked like the one you sent that was killed.

I don't have specific performance numbers, as it's more dependent upon network speed, if your not all on the same machine. Those sound like they are big enough to justify using the DISK method, or possibly CSV. Both of those store the data in a file and use pandas to read in the file to create the dataframe. The defualt (MEMORY) method of sd2df caches all of the data in memory and builds the dataframe with panda methods that aren't as fast as the read from a file methods. So, you should see the performance difference just by trying each way with a given file.

biojerm commented 4 years ago

Great I will send you the email logs from a few different scenarios! It might not be for a few hours, since I want to give the processes a chance. I'll focus on reading in the data using saspy, to prevent any possible data corruption.

tomweber-sas commented 4 years ago

Ok, cool. And, for using the DISK method, do grab 3.1.9. There is a performance improvement for DISK in there from the first implementation, which actually is in 3.17. So, may as well use the better version of it!

Thanks, Tom

tomweber-sas commented 4 years ago

As a follow up. I got logs from cases that worked, and not (saspy only round trip). And they recreated the dataset without any formats, and that worked round trip w/ saspy. Thanks to @biojerm for getting on a web-ex where we were able to diagnose the hang that was happening. Turns out it's a socket deadlock situation where SAS was writing a lot to the log regarding invalid data due to missing values for datetime columns. SAS ended up hanging due to stderr being full, cuz saspy was still writing data to stdin (not reading stderr till done), which then blocked also since SAS was blocked and wasn't reading stdin. So, the fix is to add a read of stderr in the loop that saspy is writing data so we don't deadlock like this (in case SAS writes to the log during this, which isn't the usual case - but was this time). That fixed it, but I need to test and validate more. more to come ... Tom

tomweber-sas commented 4 years ago

@biojerm I've pushed this fix, and one for the format errors, and one for the sd2fd hang when fmterr is in effect, to master. Got all 3 handled! Do you think you can grab sasiostdio.py (if not just install master), and see if all three validate on your side? No rush, when you have a chance. All look good here.

Thanks! Tom

biojerm commented 4 years ago

@tomweber-sas Would you please double check you pushed to master? This is the latest hash that I am seeing from late Dec. cc93eed5b22aec1472430b5e93650ae030f09248

tomweber-sas commented 4 years ago

I don't know what you're talking about, it's there :) LOL Yeah, totally did everything but push it. OK, it's there now, so ... Thanks man! Tom

biojerm commented 4 years ago

The update works on my dataset. Thanks!

ditordccaa commented 3 years ago

Hi @tomweber-sas , I'm reopening this issue since the transfer between SAS dataset and Pandas Dataframe is a slow on my end. I have a Python installation and the saspy package on a Windows machine. SAS however is on a Unix environment and it's a SAS Grid Platform.

Here is the information about my SAS connexion using the saspy package :

>>> sas
Access Method         = IOM
SAS Config name       = winiomlinux
SAS Config file       = P:/..../saspy/sascfg_personal.py
WORK Path             = /saswork/SAS_work49BE000057E2_my_hostname.organization.com
SAS Version           = 9.04.01M7P08052020
SASPy Version         = 3.6.4
Teach me SAS          = False
Batch                 = False
Results               = Pandas
SAS Session Encoding  = latin1
Python Encoding value = latin1
SAS process Pid value = 22498

AS for the code that I submit to read a SAS dataset into a Pandas Dataframe. The dataset has 4M rows and 160 columns and it's about 3.5GB. Sorry in advance if there are any typos since I'm working on a machine with no internet and I'm not allowed to copy/paste.

import saspy
import pandas as pd
import time

sas = saspy.SASsession(cfgname="winiomlinux")
sas.saslib("lib", path="my_specific_path")
hr = sas.sasdata(table= "my_dataset", libref="lib")

start = time.time()
df = fr.to_df_DISK(tempfile="local_path_on_the_windows_machine/my_dataset.csv",
                              tempkeep= True)
print(time.time() - start)

It's worth mentioning that when I convert small datasset (such as CARS.sas7bdat, or a dataset with 1000 observations) , the method to_df works just fine and quick. However with this dataset that' almost 3.5GB it's very very slow (it takes almost an 1h).

Do you have any leads on the matter that I can try.

Cheers.

tomweber-sas commented 3 years ago

Hey @ditordccaa, I can help with this. Can you open a new issue with this info, as this closed issue isn't the same; it's a completely different thing. Thanks! Tom