Closed biojerm closed 4 years ago
Hey, I can help with this. My initial thought is that it's not about how many rows, but rather the contents of the data. I could use a little more info to start with. Can you submit your SASsession object (sas) so I can see that output; will look something like this:
>>> sas
Access Method = STDIO
SAS Config name = sdssas
SAS Config file = /opt/tom/github/saspy/saspy/sascfg_personal.py
WORK Path = /sastmp/SAS_workF33A000052DC_tom64-5/
SAS Version = 9.04.01M6D11072018
SASPy Version = 3.1.9
Teach me SAS = False
Batch = False
Results = Pandas
SAS Session Encoding = latin9
Python Encoding value = iso8859_15
SAS process Pid value = 21212
And also submit sas_df.dtypes, so I can see what types the data frame has. When you say you can send 91 rows but not more, what have you tried? Could it be something about the contents of row 92 that it causing a problem? or can you send more than 91 of a different range of rows from the data? df2sd generates code to send delimited rows of data to SAS and uses low hex value characters as the delimiter, x'03', as this isn't a value that would appear in character data. Does this data have binary data in character columns? The other data problem could be if the character columns have newlines in them. That can cause issues too, and there's an option on df2sd to address that; embedded_newlines=True. Could try that to see if that could be the problem.
The next thing I would try is using SAS to read that dataset. Is there any issue with that? Can you use sd2df to read it and create the data frame? If so, does sd2sd then work on that dataframe?
Thanks! Tom
Thanks for getting back! I did some more investigation
1) Had a colleague open the dataset in an interactive SAS session and there were no issues. So I don't think the .sas7bdat dataset is corrupted (so that is a good sign)
2) Here are the setting of my sas
object
In [72]: sas = saspy.SASsession(cfgfile='~/sascfg_personal.py', cfgname='mycfg')
SAS Connection established. Subprocess id is 102939
No encoding value provided. Will try to determine the correct encoding.
Setting encoding to latin_1 based upon the SAS session encoding value of latin1.
In [73]: sas
Out[73]:
Access Method = STDIO
SAS Config name = mycfg
SAS Config file = /home/jlabarge/sascfg_personal.py
WORK Path = /tmp/SAS_workFB250001921C_statsrv/
SAS Version = 9.04.01M2P07232014
SASPy Version = 3.1.7
Teach me SAS = False
Batch = False
Results = Pandas
SAS Session Encoding = latin1
Python Encoding value = latin_1
SAS process Pid value = 102940
]
3) here are the datatypes of the dataframe built using pd.read_sas
'protocol', 'site', 'visit', 'spctm', 'spcode', 'filetime',
'assay_error', 'new_record', 'result_excluded', 'any_errors',
'id_statusassayresqc', 'pending', 'missing_sas_code',
'id_statusassayresqc_old', 'set_failure', 'spid_bl', 'procldmsid',
'triplet', 'community_no', 'ptid', 'alqcntoth', 'alqvoloth',
'alqlaboth', 'alqcnt1', 'alqcnt2', 'alqcnt3', 'alqcnt4', 'alqcnt5',
'alqvol1', 'alqvol2', 'alqvol3', 'alqvol4', 'alqvol5', 'alqlab1',
'alqlab2', 'alqlab3', 'alqlab4', 'alqlab5', 'visit_a', 'spcol',
'requested', 'exp_status', 'from_nt_ds', 'visit_e', 'assayexp_exclude',
'problem_code', 'resultstat_ch'],
dtype='object'),
dtype('<M8[ns]'): Index(['filedate', 'reqdate', 'requestdt', 'time_status', 'procdt', 'spcdt_a',
'spcdt_e'],
dtype='object'),
dtype('O'): Index(['conid', 'guspec', 'ASSAYTYPE', 'rluc', 's_coc', 'cp', 'pid', 'sid',
'name', 'gender', 'dobc', 'assay', 'result', 'dilution', 'code',
'module', 'caldtc', 'operator', 'releasedby', 'reagentlot', 'reagent',
'drawdtc', 'compldtc', 'location', 'file_name_orig', 'file_name_labkey',
'network', 'labid', 'specid', 'spectm', 'assayspec', 'specnm',
'specode', 'csvfile', 'subtype', 'reqdatec', 'comment', 's_comment',
'ref', 'requestby', 'requestfmt', 'excludefrom', 'file_name_merge',
'file_ext_orig', 'status_source', 'sas_version', 'file_input',
'file_name', 'file_dir', 'file_full', 'primstr', 'addstr', 'dervst2',
'cohort2', 'volstr', 'pktime_a', 'pktime_e', 'problem_code_note'],
dtype='object')}
Looks like my datatypes are: object('0'), float64, and datetime64[ns]
4) I tired a different range of columns excluding the previously problematic column and saw that 90 rows reads in fine, 91 does not
In [85]: sas = saspy.SASsession(cfgfile='~/sascfg_personal.py', cfgname='mycfg')
SAS Connection established. Subprocess id is 114524
No encoding value provided. Will try to determine the correct encoding.
Setting encoding to latin_1 based upon the SAS session encoding value of latin1.
In [86]: sas_ds = sas.df2sd(sas_df[93:93+90], 'test_output2') # <--- load in worked quickly
In [87]: sas_ds = sas.df2sd(sas_df[93:93+91], 'test_output2') # <--- load failed
^C---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
<ipython-input-87-88933f55643d> in <module>
----> 1 sas_ds = sas.df2sd(sas_df[93:93+91], 'test_output2')
/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasbase.py in df2sd(self, df, table, libref, results, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
1011 :return: SASdata object
1012 """
-> 1013 return self.dataframe2sasdata(df, table, libref, results, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
1014
1015 def dataframe2sasdata(self, df: 'pandas.DataFrame', table: str = '_df', libref: str = '',
/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasbase.py in dataframe2sasdata(self, df, table, libref, results, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
1045 return None
1046 else:
-> 1047 self._io.dataframe2sasdata(df, table, libref, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
1048
1049 if self.exist(table, libref):
/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasiostdio.py in dataframe2sasdata(self, df, table, libref, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
1453 #self._asubmit(card, "text")
1454
-> 1455 self._asubmit(";;;;", "text")
1456 ll = self.submit("run;", 'text')
1457 return
/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasiostdio.py in _asubmit(self, code, results)
529 self.stdin.write(odsclose)
530
--> 531 self.stdin.flush()
532
533 return str(out)
KeyboardInterrupt:
In [88]:
5) Tried a smaller subset of data that encompassed the rows that were not loading previously. And the this worked, so I don't think there is an issue with the row of data specifically.
SAS Connection established. Subprocess id is 127914
No encoding value provided. Will try to determine the correct encoding.
Setting encoding to latin_1 based upon the SAS session encoding value of latin1.
In [105]: sas_ds = sas.df2sd(sas_df[80:100], 'test_output2')
In [106]:
6) Tried to use the embedded_newlines
option.
SAS Connection established. Subprocess id is 117641
No encoding value provided. Will try to determine the correct encoding.
Setting encoding to latin_1 based upon the SAS session encoding value of latin1.
In [91]: sas_ds = sas.df2sd(sas_df[93:93+91], 'test_output2', embedded_newlines=True)
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
<ipython-input-91-06c4ff4de519> in <module>
----> 1 sas_ds = sas.df2sd(sas_df[93:93+91], 'test_output2', embedded_newlines=True)
/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasbase.py in df2sd(self, df, table, libref, results, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
1011 :return: SASdata object
1012 """
-> 1013 return self.dataframe2sasdata(df, table, libref, results, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
1014
1015 def dataframe2sasdata(self, df: 'pandas.DataFrame', table: str = '_df', libref: str = '',
/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasbase.py in dataframe2sasdata(self, df, table, libref, results, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
1045 return None
1046 else:
-> 1047 self._io.dataframe2sasdata(df, table, libref, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
1048
1049 if self.exist(table, libref):
/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasiostdio.py in dataframe2sasdata(self, df, table, libref, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
1453 #self._asubmit(card, "text")
1454
-> 1455 self._asubmit(";;;;", "text")
1456 ll = self.submit("run;", 'text')
1457 return
/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasiostdio.py in _asubmit(self, code, results)
529 self.stdin.write(odsclose)
530
--> 531 self.stdin.flush()
532
533 return str(out)
7) Tried to use the saspy
sd2df
method, and this did create a df. I only gave it 5-10 min though, so if I need to wait longer please let me know:
In [102]: all_sas_df = sas.sd2df('NNNNXXX.sas7bdat')
^C---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
<ipython-input-102-fbd6dbb693d4> in <module>
----> 1 all_sas_df = sas.sd2df('NNNNXXX.sas7bdat')
/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasbase.py in sd2df(self, table, libref, dsopts, method, **kwargs)
1094 """
1095 dsopts = dsopts if dsopts is not None else {}
-> 1096 return self.sasdata2dataframe(table, libref, dsopts, method, **kwargs)
1097
1098 def sd2df_CSV(self, table: str, libref: str = '', dsopts: dict = None, tempfile: str = None,
/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasbase.py in sasdata2dataframe(self, table, libref, dsopts, method, **kwargs)
1236
1237 dsopts = dsopts if dsopts is not None else {}
-> 1238 if self.exist(table, libref) == 0:
1239 print('The SAS Data Set ' + libref + '.' + table + ' does not exist')
1240 return None
/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasbase.py in exist(self, table, libref)
704 :rtype: bool
705 """
--> 706 return self._io.exist(table, libref)
707
708 def sasets(self) -> 'SASets':
/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasiostdio.py in exist(self, table, libref)
915 code += "put 'TABLE_EXISTS=' e 'TAB_EXTEND=';run;"
916
--> 917 ll = self.submit(code, "text")
918
919 exists = int(ll['LOG'].rpartition("TABLE_EXISTS=")[2].rpartition(" TAB_EXTEND=")[0])
/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasiostdio.py in submit(self, code, results, prompt)
632
633 out = self.stdin.write(b'\n'+logcodei.encode(self.sascfg.encoding)+b'\n')
--> 634 self.stdin.flush()
635
636 while not done:
KeyboardInterrupt:
8) I also have access to SAS_UTF8 in house, so I updated my config and tried using that with no success.
In [110]: sas
Out[110]:
Access Method = STDIO
SAS Config name = mycfg
SAS Config file = /home/jlabarge/sascfg_personal.py
WORK Path = /tmp/SAS_work31FC00003A38_statsrv/
SAS Version = 9.04.01M2P07232014
SASPy Version = 3.1.7
Teach me SAS = False
Batch = False
Results = Pandas
SAS Session Encoding = utf-8
Python Encoding value = utf_8
SAS process Pid value = 14904
In [111]: sas_df.shape
Out[111]: (133361, 118)
In [112]: sas_ds = sas.df2sd(sas_df[1:100], 'test_output2')
^C---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
<ipython-input-112-b804163a763f> in <module>
----> 1 sas_ds = sas.df2sd(sas_df[1:100], 'test_output2')
/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasbase.py in df2sd(self, df, table, libref, results, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
1011 :return: SASdata object
1012 """
-> 1013 return self.dataframe2sasdata(df, table, libref, results, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
1014
1015 def dataframe2sasdata(self, df: 'pandas.DataFrame', table: str = '_df', libref: str = '',
/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasbase.py in dataframe2sasdata(self, df, table, libref, results, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
1045 return None
1046 else:
-> 1047 self._io.dataframe2sasdata(df, table, libref, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
1048
1049 if self.exist(table, libref):
/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasiostdio.py in dataframe2sasdata(self, df, table, libref, keep_outer_quotes, embedded_newlines, LF, CR, colsep)
1423 code += "format "+format+";\n"
1424 code += "infile datalines delimiter="+delim+" DSD STOPOVER;\n input "+input+";\n"+xlate+";\n datalines4;"
-> 1425 self._asubmit(code, "text")
1426
1427 for row in df.itertuples(index=False):
/usr/local/apps/python/anaconda3-2018-12-controlled/lib/python3.7/site-packages/saspy/sasiostdio.py in _asubmit(self, code, results)
524 self.stdin.write(odsopen)
525
--> 526 out = self.stdin.write(code.encode(self.sascfg.encoding)+b'\n')
527
528 if (ods):
KeyboardInterrupt:
In [113]:
Thanks again, for helping. Please let me know if you have any ideas or other troubleshooting steps I can try.
Best, Jeremy
Good job trying different things out!
For the sd2df case, I should have provided code example for that. it doesn't work doing all_sas_df = sas.sd2df('NNNNXXX.sas7bdat')
You would need to assign the library where that data set is, in the SAS session, and then use sd2df with the table name and the libref.
sas.saslib('mylib', path='/wherever/that/dataset/lives')
# lets see what SAS thinks this is
sd = sas.sasdata('NNNNXXX', 'mylib', results='html')
sd.contents()
# try sd2df
df = sas.sd2df('NNNNXXXX', 'mylib') # or, as we have the SASdata object, just -> df = sd.to_df()
# if it ran, look at the dataframe
df.head()
df.dtypes
That would run the sd2fd case.
The other thing would be, run the case that works, less then 92 rows, and then submit print(sas.saslog()) so that we can see the code generated for df2sd; maybe there's something there I will see.
I will dig into this more tomorrow too.
And, I'm not sure I understand the output from the dtypes on that DF the pandas method created. when I run dtypes, it looks more like this:
>>> df = sas.sd2df('cars','sashelp')
>>> df.dtypes
Make object
Model object
Type object
Origin object
DriveTrain object
MSRP int64
Invoice int64
EngineSize float64
Cylinders float64
Horsepower int64
MPG_City int64
MPG_Highway int64
Weight int64
Wheelbase int64
Length int64
dtype: object
>>>
Which shows each columns data type. I'm more curious now if reading the dataset w/ SAS and moving it back and forth works. sd2fd, df2sd, and if there's something different about the dataframe the pandas thing creates that is somehow nor working with what I'm doing. ???
Any chance of a web-ex to do some quicker exploration? I can set that up and maybe get at this quicker if that's ok on your side.
Thanks! Tom
sorry about the datatypes I tried to get clever and do a group by. I have too many columns for the full output but here is what ipython shows. Overall the datatypes are object, float64 and datetime64[ns].
compldtc object
location object
id_file float64
id_replace_file float64
...
volstr object
alqcnt1 float64
alqcnt2 float64
alqcnt3 float64
alqcnt4 float64
alqcnt5 float64
alqvol1 float64
alqvol2 float64
alqvol3 float64
alqvol4 float64
alqvol5 float64
alqlab1 float64
alqlab2 float64
alqlab3 float64
alqlab4 float64
alqlab5 float64
visit_a float64
spcdt_a datetime64[ns]
pktime_a object
spcol float64
requested float64
exp_status float64
from_nt_ds float64
visit_e float64
spcdt_e datetime64[ns]
pktime_e object
assayexp_exclude float64
problem_code float64
problem_code_note object
resultstat_ch float64
I will try the sd2df again and post the results
I was able to read the dataset into sas using your suggestion above. I also added in an option to ignore FormatErrors. However when I ran the sd.to_df() method it hung. Here is a snippet of the log with the 'ERROR:' (s).
NOTE: 119 records were written to the file STDERR.
The minimum record length was 1.
The maximum record length was 9.
NOTE: There were 0 observations read from the data set MYLIB.NNNXXX.
NOTE: There were 1 observations read from the data set WORK._N_U_L_L_.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.02 seconds
382 proc delete data=work._n_u_l_l_;run;
NOTE: Deleting WORK._N_U_L_L_ (memtype=DATA).
NOTE: PROCEDURE DELETE used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
383
384 ;*';*";*/;
385 %put %upcase(e3969440a681a2408885998500000028);
E3969440A681A2408885998500000028
386 filename sock socket ':54793' lrecl=1048576 recfm=v termstr=LF;
NOTE: Data file MYLIB.NNNXXX.DATA is in a format that is native to another host, or the file encoding does not match the session
encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce
performance.
387 data _null_; set mylib.'l071hca'n ;
388 file sock dlm='02'x ; put 'conid'n 'guspec'n 'ldmsid'n best32. '02'x 'ASSAYTYPE'n 'rluc'n 's_coc'n 'recnum'n best32. '02'x
388! 'cp'n 'pid'n 'sid'n 'name'n 'gender'n 'dobc'n 'assay'n 'rlu'n best32. '02'x 's_co'n best32. '02'x 'result'n 'dilution'n 'code'n
388! 'module'n 'caldtc'n 'operator'n 'releasedby'n 'reagentlot'n 'reagent'n 'drawdtc'n 'compldtc'n 'location'n 'id_file'n best32.
388! dule'n 'caldtc'n 'operator'n 'releasedby'n 'reagentlot'n 'reagent'n 'drawdtc'n 'compldtc'n 'location'n 'id_file'n best32. '02'x
388! 'id_replace_file'n best32. '02'x 'file_name_orig'n 'file_name_labkey'n 'network'n 'protocol'n best32. '02'x 'labid'n 'specid'n
388! 'site'n best32. '02'x 'visit'n best32. '02'x 'spectm'n 'spctm'n E8601TM15.6 '02'x 'assayspec'n 'specnm'n 'specode'n 'spcode'n
388! best32. '02'x 'csvfile'n 'subtype'n 'filedate'n E8601DA10. '02'x 'filetime'n E8601TM15.6 '02'x 'reqdatec'n 'reqdate'n
388! E8601DA10. '02'x 'comment'n 'assay_error'n best32. '02'x 'new_record'n best32. '02'x 's_comment'n 'ref'n 'requestby'n
388! 'requestdt'n E8601DA10. '02'x 'requestfmt'n 'excludefrom'n 'result_excluded'n best32. '02'x 'any_errors'n best32. '02'x
388! 'file_name_merge'n 'file_ext_orig'n 'time_status'n E8601DT26.6 '02'x 'status_source'n 'id_statusassayresqc'n best32. '02'x
388! 'sas_version'n 'file_input'n 'file_name'n 'file_dir'n 'file_full'n 'pending'n best32. '02'x 'missing_sas_code'n best32. '02'x
388! 'id_statusassayresqc_old'n best32. '02'x 'set_failure'n best32. '02'x 'spid_bl'n best32. '02'x 'primstr'n 'addstr'n 'dervst2'n
388! 'procldmsid'n best32. '02'x 'procdt'n E8601DA10. '02'x 'triplet'n best32. '02'x 'community_no'n best32. '02'x 'cohort2'n
388! 'ptid'n best32. '02'x 'alqcntoth'n best32. '02'x 'alqvoloth'n best32. '02'x 'alqlaboth'n best32. '02'x 'volstr'n 'alqcnt1'n
388! best32. '02'x 'alqcnt2'n best32. '02'x 'alqcnt3'n best32. '02'x 'alqcnt4'n best32. '02'x 'alqcnt5'n best32. '02'x 'alqvol1'n
388! best32. '02'x 'alqvol2'n best32. '02'x 'alqvol3'n best32. '02'x 'alqvol4'n best32. '02'x 'alqvol5'n best32. '02'x 'alqlab1'n
388! best32. '02'x 'alqlab2'n best32. '02'x 'alqlab3'n best32. '02'x 'alqlab4'n best32. '02'x 'alqlab5'n best32. '02'x 'visit_a'n
388! best32. '02'x 'spcdt_a'n E8601DA10. '02'x 'pktime_a'n 'spcol'n best32. '02'x 'requested'n best32. '02'x 'exp_status'n best32.
388! '02'x 'from_nt_ds'n best32. '02'x 'visit_e'n best32. '02'x 'spcdt_e'n E8601DA10. '02'x 'pktime_e'n 'assayexp_exclude'n best32.
388! '02'x 'problem_code'n best32. '02'x 'problem_code_note'n 'resultstat_ch'n best32. '02'x '01'x; run;
NOTE: Format $ASSAY_LABID_NAME was not found or could not be loaded.
NOTE: Format SPNAMECRF was not found or could not be loaded.
NOTE: Format NOYESNA was not found or could not be loaded.
NOTE: Format ASSAYQC was not found or could not be loaded.
NOTE: Format RLSTAT was not found or could not be loaded.
NOTE: Format $ASSAY_LABID_NAME was not found or could not be loaded.
NOTE: The file SOCK is:
Local Host Name=comp,
Local Host IP addr=127.0.1.1,
Peer Hostname Name=comp.pc.host.org,
Peer IP addr=127.0.1.1,Peer Name=N/A,
Peer Portno=54793,Lrecl=1048576,Recfm=Variable
ERROR: Broken pipe.
FATAL: Unrecoverable I/O error detected in the execution of the DATA step program. Aborted during the EXECUTION phase.
NOTE: 104347 records were written to the file SOCK.
The minimum record length was 1980.
The maximum record length was 2184.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 104348 observations read from the data set MYLIB.NNNXXX.
NOTE: DATA statement used (Total process time):
real time 1:12.43
cpu time 3.76 seconds
389
390 ;*';*";*/;
391
392
393 ;*';*";*/;
394 %put %upcase(e3969440a681a2408885998500000029);
E3969440A681A2408885998500000029
Let me know if the full log would be useful. Also as an FYI I let the process run for a minute or two and then killed it with ^C
Edits for privacy
Ah, that makes sense about the dtypes. So the sd2df looks like it was running. I'm guessing that the broken pipe was because you killed it. It was 104347 records out of 133361 and 72 seconds in to it. Can you let that run and see if it finishes? Also, you can use the sd2df_DISK() version to improve performance for this step. I see that this data set isn't from this same system that you're running on. That could have something to do with this. Can you run the following so we can see more about this?
sas.saslib('mylib', path='/wherever/that/dataset/lives')
# lets see what SAS thinks this is
sd = sas.sasdata('NNNNXXX', 'mylib', results='text')
sd.contents()
# try sd2df or sd2df_DISK
df = sas.sd2df('NNNNXXXX', 'mylib') # or, df = sd.to_df_DISK() which should be faster
# if it ran, look at the dataframe
df.head()
df.dtypes
Also, can we see the log from a df2sd that worked? Might be something useful in that. Oh, and I'm not sure the _DISK version is in 3.1.7, I added that recently, so to use it, you might want to grab the current version. 3.19. The _DISK version of sd2df is faster for lager data sets than the default one, just like the _CSV is, but handles special case data better than CSV.
Thanks! Tom
Could you clarify your concern with the different file systems? I am SSH-ing into a Unix system that hosts our SAS lic. And all of my commands are being executed there.
I was able to get the sd2df to work this morning, so that is a good sign. Took about 3-5min. I will look into if we can install the SasPy 3.19
on our systems.
I am working on generating logs for the df2sd that worked, Unfortunately I cannot just post the whole log because or my company/patient privacy concerns.
So I have a baseline, what size/complexity datasets do you see performance hits? We are going to have datasets that are hundreds of mb to a few Gb in size, with many 100+ columns. Do these qualify as 'large'
In the wrong order :)
You can email logs directly to me, instead of posting them and having to take the time to obfuscate them. If you would be allowed to send something to our Tech Support, for instance, it can come to me. It's all confidential.
The message
NOTE: Data file MYLIB.NNNXXX.DATA is in a format that is native to another host, or the file encoding does not match the session
encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce
performance.
means that the SAS data set was created by a SAS session that wasn't the same system or architecture that the SAS session accessing it is running on. So, the engine has to convert the contents of it to this systems architecture on input. The things that can be different are the endian-nes (byte swapped or not), The offsets and length of numeric types. Character encoding, is a different thing, but that's also something to understand too. So, it's conceivable, that if the pandas code reading that dataset doesn't take all of that into account, then some of what's in that dataframe may be corrupt. That's why I wanted to use saspy (SAS) to read in this data set and create a datafram that way, so we can see if that works round trip, and compare against the data frame you're getting from the pandas method. I don't know anything about the data itself, so I'm still guessing at reasons this would be failing. It still seems like it's data specific, because if it can do 91 rows, it can do 92 rows. Unless something about what's in some row is causing a problem.
Since sd2df did read in and create the DF. Can you then df2sd that dataframe back? And, was there anything interesting in the log from that? I'm guessing it looked like the one you sent that was killed.
I don't have specific performance numbers, as it's more dependent upon network speed, if your not all on the same machine. Those sound like they are big enough to justify using the DISK method, or possibly CSV. Both of those store the data in a file and use pandas to read in the file to create the dataframe. The defualt (MEMORY) method of sd2df caches all of the data in memory and builds the dataframe with panda methods that aren't as fast as the read from a file methods. So, you should see the performance difference just by trying each way with a given file.
Great I will send you the email logs from a few different scenarios! It might not be for a few hours, since I want to give the processes a chance. I'll focus on reading in the data using saspy, to prevent any possible data corruption.
Ok, cool. And, for using the DISK method, do grab 3.1.9. There is a performance improvement for DISK in there from the first implementation, which actually is in 3.17. So, may as well use the better version of it!
Thanks, Tom
As a follow up. I got logs from cases that worked, and not (saspy only round trip). And they recreated the dataset without any formats, and that worked round trip w/ saspy. Thanks to @biojerm for getting on a web-ex where we were able to diagnose the hang that was happening. Turns out it's a socket deadlock situation where SAS was writing a lot to the log regarding invalid data due to missing values for datetime columns. SAS ended up hanging due to stderr being full, cuz saspy was still writing data to stdin (not reading stderr till done), which then blocked also since SAS was blocked and wasn't reading stdin. So, the fix is to add a read of stderr in the loop that saspy is writing data so we don't deadlock like this (in case SAS writes to the log during this, which isn't the usual case - but was this time). That fixed it, but I need to test and validate more. more to come ... Tom
@biojerm I've pushed this fix, and one for the format errors, and one for the sd2fd hang when fmterr is in effect, to master. Got all 3 handled! Do you think you can grab sasiostdio.py (if not just install master), and see if all three validate on your side? No rush, when you have a chance. All look good here.
Thanks! Tom
@tomweber-sas Would you please double check you pushed to master? This is the latest hash that I am seeing from late Dec. cc93eed5b22aec1472430b5e93650ae030f09248
I don't know what you're talking about, it's there :) LOL Yeah, totally did everything but push it. OK, it's there now, so ... Thanks man! Tom
The update works on my dataset. Thanks!
Hi @tomweber-sas , I'm reopening this issue since the transfer between SAS dataset and Pandas Dataframe is a slow on my end. I have a Python installation and the saspy package on a Windows machine. SAS however is on a Unix environment and it's a SAS Grid Platform.
Here is the information about my SAS connexion using the saspy package :
>>> sas
Access Method = IOM
SAS Config name = winiomlinux
SAS Config file = P:/..../saspy/sascfg_personal.py
WORK Path = /saswork/SAS_work49BE000057E2_my_hostname.organization.com
SAS Version = 9.04.01M7P08052020
SASPy Version = 3.6.4
Teach me SAS = False
Batch = False
Results = Pandas
SAS Session Encoding = latin1
Python Encoding value = latin1
SAS process Pid value = 22498
AS for the code that I submit to read a SAS dataset into a Pandas Dataframe. The dataset has 4M rows and 160 columns and it's about 3.5GB. Sorry in advance if there are any typos since I'm working on a machine with no internet and I'm not allowed to copy/paste.
import saspy
import pandas as pd
import time
sas = saspy.SASsession(cfgname="winiomlinux")
sas.saslib("lib", path="my_specific_path")
hr = sas.sasdata(table= "my_dataset", libref="lib")
start = time.time()
df = fr.to_df_DISK(tempfile="local_path_on_the_windows_machine/my_dataset.csv",
tempkeep= True)
print(time.time() - start)
It's worth mentioning that when I convert small datasset (such as CARS.sas7bdat, or a dataset with 1000 observations) , the method to_df works just fine and quick. However with this dataset that' almost 3.5GB it's very very slow (it takes almost an 1h).
Do you have any leads on the matter that I can try.
Cheers.
Hey @ditordccaa, I can help with this. Can you open a new issue with this info, as this closed issue isn't the same; it's a completely different thing. Thanks! Tom
Hi, I am having trouble passing a large pandas DataFrame to the sas kernel using the
df2sd
method. For testing purposes, I am trying to write read a SAS dataset into a pandas DataFrame and then write it back out into a SAS dataset (for reference the SAS dataset is about ~560 mb). I have a toy script that can write small pandas DataFrames to SAS datasets. However, when I tried to create a SAS dataset from a large pandas DataFrame, I noticed thedf2sd
step seems to hang. For some reason 91 rows seems to be the limit. When only 91 rows of data are sent to the SAS kernel, it is pretty quick. However when 91+ rows are sent to the kernel the program hangs (greater that 1 hr before I got impatient). Below is the output from the ipython session I was running.here is my sascfg_personal.py
and my saspy version:
Thanks in advance!