Closed moshekaplan closed 6 years ago
Unless I'm misreading your question, I believe you'd simply want to convert the SAS data set into a (pandas) DataFrame and write to csv that way.
Eg:
frame = SASdata.to_df() # data is now local to Python
frame.to_csv(...) # with your path, options
I attempted to run the following commands:
session=saspy.SASsession(..) # IOM connection
data = session.sasdata(table_name, lib_name)
frame = data.to_df()
When it tries to execute frame = data.to_df()
it raises an error:
AssertionError: 37 columns passed, passed data had 40 columns
@moshekaplan what version of saspy do you have? You can submit your SASsession object and that will provide the info. And, I'd like to see what the data looks like. Can you submit the following:
session
data.contents()
Thanks! Tom
(Pdb) session
Access Method = IOM
SAS Config name = MY_CONFIG
WORK Path = S:\SANITIZED_PATH\Prc2\
SAS Version = 9.04.01M2P07232014
SASPy Version = 2.2.4
Teach me SAS = False
Batch = False
Results = Pandas
SAS Session Encoding = WLATIN1
Python Encoding value = latin1
(Pdb) data.contents()
{'Sortedby': Member Label1 \
0 SASUSER.MOSHE_TEMP_TABLE_TEST_2 Sortedby
1 SASUSER.MOSHE_TEMP_TABLE_TEST_2 Validated
2 SASUSER.MOSHE_TEMP_TABLE_TEST_2 Character Set
cValue1 nValue1
0 Event_Date Event_Time Event_Seq NaN
1 YES NaN
2 ANSI NaN , 'Attributes': Member Label1 \
0 SASUSER.MOSHE_TEMP_TABLE_TEST_2 Data Set Name
1 SASUSER.MOSHE_TEMP_TABLE_TEST_2 Member Type
2 SASUSER.MOSHE_TEMP_TABLE_TEST_2 Engine
3 SASUSER.MOSHE_TEMP_TABLE_TEST_2 Created
4 SASUSER.MOSHE_TEMP_TABLE_TEST_2 Last Modified
5 SASUSER.MOSHE_TEMP_TABLE_TEST_2 Protection
6 SASUSER.MOSHE_TEMP_TABLE_TEST_2 Data Set Type
7 SASUSER.MOSHE_TEMP_TABLE_TEST_2 Label
8 SASUSER.MOSHE_TEMP_TABLE_TEST_2 Data Representation
9 SASUSER.MOSHE_TEMP_TABLE_TEST_2 Encoding
cValue1 nValue1 Label2 \
0 SASUSER.MOSHE_TEMP_TABLE_TEST_2 NaN Observations
1 DATA NaN Variables
2 V9 NaN Indexes
3 04/17/2018 15:41:57 1.839599e+09 Observation Length
4 04/17/2018 15:41:57 1.839599e+09 Deleted Observations
5 NaN Compressed
6 NaN Reuse Space
7 NaN Point to Observations
8 WINDOWS_64 NaN Sorted
9 wlatin1 Western (Windows) NaN
cValue2 nValue2
0 8328 8328.0
1 37 37.0
2 0 0.0
3 3522 3522.0
4 0 0.0
5 CHAR NaN
6 NO NaN
7 YES NaN
8 YES NaN
9 0.0 , 'Enginehost': Member Label1 \
0 SASUSER.MOSHE_TEMP_TABLE_TEST_2 Data Set Page Size
1 SASUSER.MOSHE_TEMP_TABLE_TEST_2 Number of Data Set Pages
2 SASUSER.MOSHE_TEMP_TABLE_TEST_2 Number of Data Set Repairs
3 SASUSER.MOSHE_TEMP_TABLE_TEST_2 ExtendObsCounter
4 SASUSER.MOSHE_TEMP_TABLE_TEST_2 Filename
5 SASUSER.MOSHE_TEMP_TABLE_TEST_2 Release Created
6 SASUSER.MOSHE_TEMP_TABLE_TEST_2 Host Created
cValue1 nValue1
0 16384 16384.0
1 257 257.0
2 0 0.0
3 YES NaN
4 S:\SANITIZED_PATH\moshe_temp_table_te... NaN
5 9.0401M2 NaN
6 X64_ES08R2 NaN , 'Variables': Member Num Variable Type Len \
0 SASUSER.MOSHE_TEMP_TABLE_TEST_2 19 SANITIZED_00 Num 6
1 SASUSER.MOSHE_TEMP_TABLE_TEST_2 21 SANITIZED_21 Char 75
2 SASUSER.MOSHE_TEMP_TABLE_TEST_2 22 SANITIZED_22 Char 75
3 SASUSER.MOSHE_TEMP_TABLE_TEST_2 7 SANITIZED_07 Char 100
4 SASUSER.MOSHE_TEMP_TABLE_TEST_2 1 SANITIZED_01 Char 10
5 SASUSER.MOSHE_TEMP_TABLE_TEST_2 3 SANITIZED_03 Num 8
6 SASUSER.MOSHE_TEMP_TABLE_TEST_2 2 SANITIZED_02 Char 8
7 SASUSER.MOSHE_TEMP_TABLE_TEST_2 6 SANITIZED_06 Char 39
8 SASUSER.MOSHE_TEMP_TABLE_TEST_2 9 SANITIZED_09 Char 10
9 SASUSER.MOSHE_TEMP_TABLE_TEST_2 5 SANITIZED_05 Char 39
10 SASUSER.MOSHE_TEMP_TABLE_TEST_2 10 SANITIZED_10 Char 39
11 SASUSER.MOSHE_TEMP_TABLE_TEST_2 15 SANITIZED_15 Char 100
12 SASUSER.MOSHE_TEMP_TABLE_TEST_2 37 SANITIZED_37 Char 1000
13 SASUSER.MOSHE_TEMP_TABLE_TEST_2 20 SANITIZED_20 Char 8
14 SASUSER.MOSHE_TEMP_TABLE_TEST_2 17 SANITIZED_17 Char 5
15 SASUSER.MOSHE_TEMP_TABLE_TEST_2 16 SANITIZED_16 Char 75
16 SASUSER.MOSHE_TEMP_TABLE_TEST_2 18 SANITIZED_18 Char 75
17 SASUSER.MOSHE_TEMP_TABLE_TEST_2 25 SANITIZED_25 Char 1000
18 SASUSER.MOSHE_TEMP_TABLE_TEST_2 8 SANITIZED_08 Char 50
19 SASUSER.MOSHE_TEMP_TABLE_TEST_2 24 SANITIZED_24 Char 6
20 SASUSER.MOSHE_TEMP_TABLE_TEST_2 23 SANITIZED_23 Char 500
21 SASUSER.MOSHE_TEMP_TABLE_TEST_2 4 SANITIZED_04 Char 20
22 SASUSER.MOSHE_TEMP_TABLE_TEST_2 14 SANITIZED_14 Char 20
23 SASUSER.MOSHE_TEMP_TABLE_TEST_2 12 SANITIZED_12 Num 8
24 SASUSER.MOSHE_TEMP_TABLE_TEST_2 32 SANITIZED_32 Char 39
25 SASUSER.MOSHE_TEMP_TABLE_TEST_2 33 SANITIZED_33 Num 8
26 SASUSER.MOSHE_TEMP_TABLE_TEST_2 34 SANITIZED_34 Num 4
27 SASUSER.MOSHE_TEMP_TABLE_TEST_2 35 SANITIZED_35 Num 6
28 SASUSER.MOSHE_TEMP_TABLE_TEST_2 36 SANITIZED_36 Num 6
29 SASUSER.MOSHE_TEMP_TABLE_TEST_2 27 SANITIZED_27 Char 39
30 SASUSER.MOSHE_TEMP_TABLE_TEST_2 28 SANITIZED_28 Num 8
31 SASUSER.MOSHE_TEMP_TABLE_TEST_2 29 SANITIZED_29 Num 4
32 SASUSER.MOSHE_TEMP_TABLE_TEST_2 30 SANITIZED_30 Num 6
33 SASUSER.MOSHE_TEMP_TABLE_TEST_2 31 SANITIZED_31 Num 6
34 SASUSER.MOSHE_TEMP_TABLE_TEST_2 13 SANITIZED_13 Char 25
35 SASUSER.MOSHE_TEMP_TABLE_TEST_2 11 SANITIZED_11 Char 20
36 SASUSER.MOSHE_TEMP_TABLE_TEST_2 26 SANITIZED_26 Char 75
Pos Format Label
0 40 COMMA10. SANITIZED_00
1 713 $CHAR75. SANITIZED_01
2 788 $CHAR75. SANITIZED_02
3 186 $CHAR100. SANITIZED_03
4 70 $CHAR10. SANITIZED_04
5 0 Z15. SANITIZED_05
6 80 $CHAR8. SANITIZED_06
7 147 $CHAR39. SANITIZED_07
8 336 $CHAR10. SANITIZED_08
9 108 $CHAR39. SANITIZED_09
10 346 $CHAR39. SANITIZED_10
11 450 $CHAR100. SANITIZED_11
12 2522 $CHAR1000. SANITIZED_12
13 705 $CHAR8. SANITIZED_13
14 625 $CHAR5. SANITIZED_14
15 550 $CHAR75. SANITIZED_15
16 630 $CHAR75. SANITIZED_16
17 1369 $CHAR1000. SANITIZED_17
18 286 $CHAR50. SANITIZED_18
19 1363 $CHAR6. SANITIZED_19
20 863 $CHAR500. SANITIZED_20
21 88 $CHAR20. SANITIZED_21
22 430 $CHAR20. SANITIZED_22
23 8 DATETIME24. SANITIZED_23
24 2483 $CHAR39. SANITIZED_24
25 24 HEX12. SANITIZED_25
26 36 HEX4. SANITIZED_26
27 58 HEX8. SANITIZED_27
28 64 HEX8. SANITIZED_28
29 2444 $CHAR39. SANITIZED_29
30 16 HEX12. SANITIZED_30
31 32 HEX4. SANITIZED_31
32 46 HEX8. SANITIZED_32
33 52 HEX8. SANITIZED_33
34 405 $CHAR25. SANITIZED_34
35 385 $CHAR20. SANITIZED_35
36 2369 $CHAR75. SANITIZED_36 }
Thanks!, I wonder if this could be due to the encodings not matching up. It looks like SAS is running with Windows Latin 1, which is a different code page than Latin1. In your 'bear' configuration definition, can you can you change (or add in) the encoding key and set it to use windows latin1:
'encoding' : 'cp1252',
Let's see if that fixes it. I don't recognize the error that you were getting, so it's possible it's a transcoding issue that's corrupting the data and then having an odd downstream effect causing that error.
Thanks! Tom
Same issue manifests when using 'encoding' : 'cp1252'
Hmm, do you mind trying this, just to see if it's specific to this data or something more pervasive.
cars = session.sasdata('cars', 'sashelp')
df = cars.to_df()
df.head()
cars.head()
Thanks, Tom
Seems to run without issue:
(Pdb) cars = session.sasdata('cars', 'sashelp') (Pdb) df = cars.to_df() (Pdb) df.head()
Make Model Type Origin DriveTrain MSRP Invoice EngineSize \
0 Acura MDX SUV Asia All 36945 33337 3.5
1 Acura RSX Type S 2dr Sedan Asia Front 23820 21761 2.0
2 Acura TSX 4dr Sedan Asia Front 26990 24647 2.4
3 Acura TL 4dr Sedan Asia Front 33195 30299 3.2
4 Acura 3.5 RL 4dr Sedan Asia Front 43755 39014 3.5
Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 6.0 265 17 23 4451 106 189
1 4.0 200 24 31 2778 101 172
2 4.0 200 22 29 3230 105 183
3 6.0 270 20 28 3575 108 186
4 6.0 225 18 24 3880 115 197
(Pdb) cars.head()
Make Model Type Origin DriveTrain MSRP Invoice EngineSize \
0 Acura MDX SUV Asia All 36945 33337 3.5
1 Acura RSX Type S 2dr Sedan Asia Front 23820 21761 2.0
2 Acura TSX 4dr Sedan Asia Front 26990 24647 2.4
3 Acura TL 4dr Sedan Asia Front 33195 30299 3.2
4 Acura 3.5 RL 4dr Sedan Asia Front 43755 39014 3.5
Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 6 265 17 23 4451 106 189
1 4 200 24 31 2778 101 172
2 4 200 22 29 3230 105 183
3 6 270 20 28 3575 108 186
4 6 225 18 24 3880 115 197
Ok, so that's good. It would appear to be specific to that data. Looks like it too you some time to cleanse the output of the proc contents, so I expect you're not wanting to have this data here for all to see. If it's ok for me to see a little more about this data and what the problem might be, we can continue with direct email. I can also do a web-ex if that's ok with you to try to see what is happening.
I'm curious about how this data was created. If you have any problems accessing it w/in SAS:
data = session.sasdata(table_name, lib_name, results='HTML')
data.head()
The html output will remove the pandas creation step and see (you don't have to paste the output here) if the data looks right or if there's any issues with SAS processing it.
Also, 2 other things to check. If you submit
print(session.saslog())
We can see the log to see if there are any errors. Submit that after the to_df() fails.
The other thing worth trying, is to try the data.to_df_CSV() method and see if anything different happens.
Tom
Seems to attempt to print data without issue (although not actually printable):
(Pdb) data = session.sasdata(table_name, lib_name, results='HTML')
(Pdb) data.head()
<IPython.core.display.HTML object>
I also tried data.to_df_CSV()
, but that too failed.
I'll see about emailing you the saslog()
directly. I'll likely have to get management approval for that.
That isn't right, it looks like you're in the debugger -> the (Pdb) part. Is this running in a notebook? Or are you running this in a shell? Sorry if so; HTML won't render in that mode. You can just assign it to a variable then write that variable to a file and then open it in a browser if you're in line mode.
x = data.head()
fd = open('filesystempath\data.html', 'wb')
fd.write(x)
fd.close()
Then you should be able to click it and it should open in your browser (windows) or open it with whatever viewer in linux.
data.head()
returns None
, so it's not possible to write the HTML to the disk.
Ah, it's because of https://github.com/sassoftware/saspy/blob/d28a017212ac23eeae8e40005cad858044de28af/saspy/sasbase.py#L1130
I set it to batch and was able to retrieve the HTML content.
If I can make a recommendation: Functions should always return the value, and batch mode should only control whether or not the output is printed.
Good job, yes, batch is what that's for! So, how are you running this? What is pdb; I know that as the python debugger. Are you running line mode?
My plan is to use saspy within a larger script - not interactively. I'm using pdb (python debugger) because I wanted to execute some of the existing code and then enter a debugging session where I could try various commands.
What is "line mode"?
That's all good. In fact that's what 'batch' mode if for, so you can run as a script and still get graphs and plots and such, as html and write them to files, just like you did. By line mode I just mean running python in a shell, from a command line, as opposed to in a jupyter notebook. Just wanted to be sure I understood the environment. Did you see any errors in the log after running the to_df() method?
By line mode I just mean running python in a shell, from a command line, as opposed to in a jupyter notebook.
Yes, I'm running in line mode.
Did you see any errors in the log after running the to_df() method?
I did not see any errors in the log, but to_df() still did not work. VARNUMS= 37
and there were 37 variables.
Ok, kinda back to the beginning. I assume that Assert error is coming from dataframe constructor after bringing the data over. As there are 37 variables, and everyone seems to agree on that, but seemingly I'm passing 40 variable records to the data frame constructor (assuming that what this is), then I suspect something off parsing the data I'm streaming over. I generate a data step to stream the data over to python. I specify column and row seperators so I can split up the variables on the python side. The defaults for these are hex 01 and hex 02. Those are pretty safe for most SAS data as it SAS only has doubles and character data, and the data come over as character; the doubles are formatted out in text, so there shouldn't be any binary 01/02 in there. So, I wonder if your data has any columns that have binary data in their character columns, such that I'm ending up splitting it into multiple columns of data - says there are 40, not 37.
W/out having a better idea of the data, I can't really tell. But these values are options you can change:
def sasdata2dataframe(self, table: str, libref: str ='', dsopts: dict ={}, rowsep: str = '\x01', colsep: str = '\x02', **kwargs) -> '<Pandas Data Frame object>':
colsep and rowsep. They can only be one byte. If you can tell if you might have some char variables that have either binary 0x01 or 0x02 in them. you could try changing these to something not in them.
I will dig in to this deeper tomorrow. Sorry for the trouble, but we'll get this fixed and working for you.
Thanks, Tom
You got it - that was the exact issue - one of the fields had binary data in it. Here's how I came up with valid seperator:
data = open('TEXT_OUTPUT_MOSHE_1.txt', 'rb').read()
s = set(data)
b = set([chr(i) for i in range(256)])
x = []
for i in range(256):
try:
x.append(chr(i).encode('cp1252'))
except:
pass
x = set(x)
print ((b-s) & x)
I determined that '[' and ']' were both not in my data and safe to use as separators. But it would be better to design and use a technique that didn't rely on 'magic' bytes.
Ok, good deal. Well, the _CSV method doesn't rely on this, as it exports the data set to a csv file then streams it over, no parsing needed (not by me anyway). So, I'm curious why that failed. It can't really be the (exact) same problem, though it could still have something to do with the data. I'm curious as to what exactly is the failure with sd2cf_CSV. If you're up for it, there a branch in the repo right now called sd2df_CSV which has a change to allow you to specify a local file to write the csv data to for importing to the dataframe. W/out that, I just use a temporary file which is cleaned up. With the ability to specify the file yourself, it doesn't get cleaned up. You could try that, which would keep the CSV file there on your client to see what it looks like. And I'd be curious to what the error or failure actually is for that case. Thanks! Tom
I'm game to switch branches. Would it be possible to give the exact sequence of commands?
For example:
session=saspy.SASsession(..) # IOM connection
data = session.sasdata(table_name, lib_name)
...
Yes, there's a tempfile= now on the _CVS methods. I'm in the process of adding tempkeep= also so you can control whether the file gets cleaned up after it's used or not. So you should be able to just do the following:
session=saspy.SASsession(..) # IOM connection
data = session.sasdata(table_name, lib_name)
df = data.to_df_CSV(tempfile='os path to the file you want to use', tempkeep=True)
That should allow you to persist the CSV file on your client side. Also, this code should show the proc export in the log too, so if there's an issue there, we'll be able to see it: print(session.saslog())
Thanks! Tom
Strange. It appears that the job ran without issue, but I didn't actually get any data:
Code:
session=saspy.SASsession(..) # IOM connection
data = session.sasdata(table_name, lib_name)
df = data.to_df_CSV(tempfile='another_attempt.csv', tempkeep=True
Output:
*** pandas.errors.EmptyDataError: No columns to parse from file
Log contents:
NOTE: The file _TOMODS1 is:
Filename=S:\SANITIZED\Prc2\tomods1,
RECFM=V,LRECL=131068,File Size (bytes)=0,
Last Modified=19Apr2018:17:21:51,
Create Time=19Apr2018:17:21:51
NOTE: 8329 records were written to the file _TOMODS1.
The minimum record length was 396.
The maximum record length was 978.
NOTE: View WORK.SASDATA2DATAFRAME.VIEW used (Total process time):
real time 0.51 seconds
cpu time 0.51 seconds
NOTE: There were 8328 observations read from the data set MY_LIB.MY_TABLE.
NOTE: There were 8328 observations read from the data set WORK.SASDATA2DATAFRAME.
NOTE: DATA statement used (Total process time):
real time 0.53 seconds
cpu time 0.53 seconds
8328 records created in _TOMODS1 from SASDATA2DATAFRAME.
Hmm, so what's the contents of 'another_attempt.csv'?
Just a \n
. No other contents.
I'm going to have to mock up some data w/ 0x01, 0x02 and see what happens with that. It seems like it's failing to transfer across IOM, given what you show: SAS exported it out but nothing ended up on the client. Was there a traceback, or just the pandas empyt data error? I'll try to track this down on my side. Anything else unusual about your data other than having binary data in char columns? So I can try to simulate what you're seeing.
Thanks, Tom
@moshekaplan I've been able to reproduce some of this. I am getting the failure you observed trying to create a dataframe when transferring binary data containing 0x01 and 0x02. I've also see it work when I change my row/col separator characters. I don't see a problem however when using the _CSV method. It transfers the data and creates the dataframe. I've also used the tempfile/keep to see the contents.
I've merged in to master the sd2df_CSV track, and it's possible there were some changed mad in there after you switch to trying it. I don't know of anything specifically, but can you switch to master, maybe a fresh pull?, and see if you are seeing different results than I am.
I'll attach an HTML (you'll have to remove the extra .txt off it; can't attach .html), and a screen shot of the csv file. If you are still getting no data in your local csv, even though the remote export worked, then I'll have to guess there another data specific thing that I don't have beyond just the 01/02 row/col separator . issue128_binary.htm.txt
Thanks! Tom
Was there a traceback, or just the pandas empyt data error?
just the pandas empyt data error
I'll try to track this down on my side. Anything else unusual about your data other than having binary data in char columns? So I can try to simulate what you're seeing.
Not that I'm aware of.
I'll test again with the new master. Thanks again for all the time and effort you're putting into resolving this issue.
OK, so a little further digging: It definitely seems to be an issue caused by one of the data values. If I create a temporary table with a subset of the columns, I can retrieve the data without issue.
I'm going to see if I can narrow down the exact table and its values, so I can create and share a minimal test case.
OK, I'm not sure if a single-column table is handled differently, but when I used the code below, it didn't even print out the error message once, indicating that there were no exceptions (although some CSVs were blank):
columns = "COLUMN_1, COLUMN_12, ..., COLUMN_37".split(", ")
for c in columns:
job = "proc sql; CREATE TABLE MYLIB.%s_TABLE_%s AS SELECT %s from MYLIB.%s ; quit;" % (table_name, c, c, table_name)
print (job)
sql_results = session.submit(job, results="Pandas")
open('%s.log' % c, 'wb').write(sql_results['LOG'].encode('utf-8'))
open('%s.txt' % c, 'wb').write(sql_results['LST'].encode('utf-8'))
data = session.sasdata( "%s_TABLE_%s" % (table_name, c), "MYTABLE", results='Text')
try:
data.to_df_CSV(tempfile="mycsv_%s_%s.csv" % (table_name, c), tempkeep=True)
except:
print("Column %s failed!" % c)
As an aside, my entire business need was to select only a fraction of the columns, and so I've now accomplished that using the code below:
fname = "mycsv.csv"
job = "proc sql; CREATE TABLE MYLIB.%s_2 AS SELECT ... from MYLIB.%s_1;" % (table_name, table_name)
session.submit(job, results="Text")
data = session.sasdata( "%s_2" % table_name, "MYLIB", results='Text')
data.to_df_CSV(tempfile=fname, tempkeep=True)
However, if you'd like to debug this further, I'm happy to help.
Well, I don't like to not understand a problem. It does appear to be data specific with your data set. The constraints we're running under for this is as follows:
1) the file on the SAS side that the csv data is written to is utf-8. So it's expected that SAS transcodes the data from your SAS Session encoding (WLATIN1), to UTF writting it out. 2) the java IOM client tells IOM that file is binary, so IOM doesn't try to do any transcoding. 3) reading that binary stream into java, it's read into, effectively, a byte array and then sent to saspy. So no transcoding there either. 4) saspy expects utf-8 data, which it should be getting.
What I believe you've seen is that the csv file on the SAS side is written out: log showing proc export. You've gotten an empty csv file on the saspy side; using tempfile/keep= There's no traceback or any kind of failure? Is this all correct?
I've been looking into this more, regarding the 4 things listed above. I'm seeing that 2 and 4 aren't consistent depending upon whether it's local or remote (particularly with _CSV). For local, the proc export is written to disk and not streamed over IOM, as it is for remote. The csv file isn't the same, regarding the encoding. In both cases the filename stmt states encoding=utf8, but I'm not getting a utf8 byte stream like I'm expecting when reading over IOM, but it is when writing straight to disk. I think it might be that SAS reads the file back in to give to IOM to send over. In that case, SAS would be transcoding it back from utf8 on disk, to session encoding (wlatin1) which IOM then streams across as binary. That would account for the difference I'm seeing.
I've made some changes in the access method to account for this. This still wouldn't account for what you're seeing (if what I stated above about what you're seeing is right).
Since everything is currently up to date as master, you could try this again from master and see if you still see the same thing.
Thanks, Tom
Using commit 00b0880f0b90967d49895653aab946e6b5b6a98f :
sas = saspy.SASsession(cfgname=mycfg, omruser=config['user'], omrpw=config['pass'])
if sas is None:
sys.exit(1)
data = sas.sasdata(tablename, libname)
df = data.to_df_CSV(tempfile='another_attempt.csv', tempkeep=True)
Generates the following output:
Traceback (most recent call last):
File "test_saspy", line 334, in <module>
main()
File "test_saspy", line 308, in main
run_saspy_test()
File "test_saspy", line 183, in run_saspy_test
df = data.to_df_CSV(tempfile='another_attempt.csv', tempkeep=True)
File "saspy_github/saspy/sasbase.py", line 1911, in to_df_CSV
return self.to_df(method='CSV', tempfile=tempfile, tempkeep=tempkeep, **kwargs)
File "saspy_github/saspy/sasbase.py", line 1899, in to_df
return self.sas.sasdata2dataframe(self.table, self.libref, self.dsopts, method, **kwargs)
File "saspy_github/saspy/sasbase.py", line 755, in sasdata2dataframe
return self._io.sasdata2dataframe(table, libref, dsopts, method=method, **kwargs)
File "saspy_github/saspy/sasioiom.py", line 1178, in sasdata2dataframe
return self.sasdata2dataframeCSV(table, libref, dsopts, **kwargs)
File "saspy_github/saspy/sasioiom.py", line 1550, in sasdata2dataframeCSV
df = pd.read_csv(tmpcsv, index_col=False, engine='c', dtype=dts, **kwargs)
File "/usr/local/lib/python3.5/site-packages/pandas-0.22.0-py3.5-linux-x86_64.egg/pandas/io/parsers.py", line 709, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.5/site-packages/pandas-0.22.0-py3.5-linux-x86_64.egg/pandas/io/parsers.py", line 449, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/lib/python3.5/site-packages/pandas-0.22.0-py3.5-linux-x86_64.egg/pandas/io/parsers.py", line 818, in __init__
self._make_engine(self.engine)
File "/usr/local/lib/python3.5/site-packages/pandas-0.22.0-py3.5-linux-x86_64.egg/pandas/io/parsers.py", line 1049, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/local/lib/python3.5/site-packages/pandas-0.22.0-py3.5-linux-x86_64.egg/pandas/io/parsers.py", line 1695, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 565, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
another_attempt.csv
is empty, aside from a newline character.
I also confirmed that the table contains data.
and the local .csv file is empty? And there's nothing in the saslog showing a problem?
The local CSV is empty. Here's the output from the saslog()
:
114 ;*';*";*/;
115 data sasdata2dataframe / view=sasdata2dataframe; set SANITIZED_LIB.SANITIZED_TABLE;
116 format 'SANITIZED best32. ;
117 run;
NOTE: DATA STEP view saved on file WORK.SASDATA2DATAFRAME.
NOTE: A stored DATA STEP view cannot run under a different operating system.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
118
119
120 ;*';*";*/;
121 %put E3969440A681A2408885998500000011;
E3969440A681A2408885998500000011
122
123 proc export data=sasdata2dataframe outfile=_tomods1 dbms=csv replace;
123 ! run
124 ;
NOTE: Unable to open parameter catalog: SANITIZED_LIB.PARMS.PARMS.SLIST in update mode. Temporary parameter values will be saved to
WORK.PARMS.PARMS.SLIST.
NOTE: View WORK.SASDATA2DATAFRAME.VIEW used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
NOTE: View WORK.SASDATA2DATAFRAME.VIEW used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
NOTE: There were 18 observations read from the data set SANITIZED_LIB.SANITIZED_TABLE.
NOTE: View WORK.SASDATA2DATAFRAME.VIEW used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
NOTE: There were 18 observations read from the data set SANITIZED_LIB.SANITIZED_TABLE.
125 /**********************************************************************
126 * PRODUCT: SAS
127 * VERSION: 9.4
128 * CREATOR: External File Interface
129 * DATE: 26APR18
130 * DESC: Generated SAS Datastep Code
131 * TEMPLATE SOURCE: (None Specified.)
132 ***********************************************************************/
133 data _null_;
134 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
135 %let _EFIREC_ = 0; /* clear export record count macro variable */
136 file _TOMODS1 delimiter=',' DSD DROPOVER ;
137 if _n_ = 1 then /* write column names or labels */
138 do;
139 put
140 "SANITIZED_VAR_01"...
213 ;
214 end;
215 set SASDATA2DATAFRAME end=EFIEOD;
216 format SANITIZED_VAR_01 $char10. ;
217 ...
252 format SANITIZED_VAR_37 $char1000. ;
253 do;
254 EFIOUT + 1;
255 put SANITIZED_VAR_01 $ @;
...
291 put SANITIZED_VAR_37 $ ;
292 ;
293 end;
294 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
295 if EFIEOD then call symputx('_EFIREC_',EFIOUT);
296 run;
NOTE: The file _TOMODS1 is:
Filename=S:\SANITIZED_PATH\Prc2\tomods1,
RECFM=V,LRECL=131068,File Size (bytes)=0,
Last Modified=26Apr2018:19:55:23,
Create Time=26Apr2018:19:55:18
NOTE: 8329 records were written to the file _TOMODS1.
The minimum record length was 396.
The maximum record length was 978.
NOTE: View WORK.SASDATA2DATAFRAME.VIEW used (Total process time):
real time 0.49 seconds
cpu time 0.51 seconds
NOTE: There were 8328 observations read from the data set SANITIZED_LIB.SANITIZED_TABLE.
NOTE: There were 8328 observations read from the data set WORK.SASDATA2DATAFRAME.
NOTE: DATA statement used (Total process time):
real time 0.53 seconds
cpu time 0.54 seconds
8328 records created in _TOMODS1 from SASDATA2DATAFRAME.
NOTE: "_TOMODS1" file was successfully created.
NOTE: PROCEDURE EXPORT used (Total process time):
real time 0.73 seconds
cpu time 0.76 seconds
297
12 The SAS System 19:55 Thursday, April 26, 2018
298
299 %put E3969440A681A2408885998500000008;
E3969440A681A2408885998500000008
300
That looks just like my log too, when I run this. I just can't account for what you're getting. I can't really see a path that would do this. I'm gonna have to let this stir in the back of my head a bit.
Well, I can think of one reason this might be happening. If there's an issue transcoding the csv file on the SAS server side, when IOM is trying to read it to send it to me, I can imagine that we might not see anything in the log, and that I might not get an error along the way. Just not get the data. So, I have 2 things we can try. The first it to just put a print statement in the code where I'm reading this and writing it to the local csv file. Just to see what's coming across. We would expect nothing, but let's see. If you have the current code from master (the line numbers will match, else you can still see where to put it), you can just add 'print(data)' to sasioiom.py at line 1515:
try:
data = self.stdout[0].recv(4096).decode(self.sascfg.encoding, errors='replace')
except (BlockingIOError):
data = b''
print(data) # line 1515, let's see what we are or are not getting
if len(data) > 0:
datar += data
data = datar.rpartition('\n')
datap = data[0]+data[1]
datar = data[2]
The other thing, to see if my speculation might be right, is to do the proc export, and then try to read the file back in explicitly so we will see if there is any kind of error in the saslog. Run these in separate cells, if in a notebook.
filename = sas.workpath+"test.csv"
ll = sas.submit("filename x '"+filename+"' encoding='utf-8';proc export data=sashelp.cars outfile=x dbms=csv replace;run;")
print(ll['LOG'])
ll = sas.submit("data _null_; infile x; input @; put _infile_;run;")
print(ll['LOG'])
That's all I've got currently. And, just double checking, you are running the current master? Or an earlier version?
Thanks, Tom
Should be the newest version of master:
$ git log -1
commit 00b0880f0b90967d49895653aab946e6b5b6a98f (HEAD -> master, origin/master, origin/HEAD)
Author: Tom Weber <Tom.Weber@sas.com>
Date: Thu Apr 26 13:30:21 2018 -0400
set appname again on reconnect
I made changes similar to what you described:
try:
data = self.stdout[0].recv(4096).decode(self.sascfg.encoding, errors='replace')
except (BlockingIOError) as e:
data = b''
print (e)
print (data) # line 1515, let's see what we are or are not getting
My output was the following repeated many times:
[Errno 11] Resource temporarily unavailable
b''
I then attempted to debug it manually:
> saspy/sasioiom.py(1514)sasdata2dataframeCSV()
-> data = self.stdout[0].recv(4096).decode(self.sascfg.encoding, errors='replace')
(Pdb) import select
(Pdb) data = self.stdout[0].recv(4096)
(Pdb) len(data)
33
(Pdb) p data
b'\nE3969440A681A2408885998500000008'
(Pdb) c
[Errno 11] Resource temporarily unavailable
b''
> saspy/sasioiom.py(1513)sasdata2dataframeCSV()
-> pdb.set_trace()
(Pdb) data = self.stdout[0].recv(4096)
*** BlockingIOError: [Errno 11] Resource temporarily unavailable
(Pdb) select.select([self.stdout[0]], [] ,[self.stdout[0]])
Written out:
I broke into the Python debugger, right before the call to
```python3
data = self.stdout[0].recv(4096).decode(self.sascfg.encoding, errors='replace')
As you can see below, the first time it calls self.stdout[0].recv()
, it's returning 33 bytes of data: b'\nE3969440A681A2408885998500000008'
.
The second call raises errno 11 (EAGAIN
or EWOULDBLOCK
), as is described in recv
's manpage:
value -1 is returned and the external variable errno is set to EAGAIN or EWOULDBLOCK. The receive calls normally return any data available, up to the requested
amount, rather than waiting for receipt of the full amount requested.
To test if it wasn't waiting long enough for data to come back, I called select.select()
to wait until there's data available. The call to select.select()
waited without receiving data for more than ten minutes. I then killed it. So it seems that aside from those first 33 bytes, there is no output in self.stdout()
.
You've verified that we are getting nothing back from the SAS server for that CSV file. The b'\nE3969440A681A2408885998500000008', which was the only thing you got, is the 'termination' string I send to saspy from java to identify the transmission is complete. You are getting none of the CSV file sent back over.
So, it may very well be something like I'm speculating. Can you run the proc export and the data step and see if there's any errors in the saslog after the data step? This is progress :)
thanks! Tom
With the exception of NOTE: Unable to open parameter catalog: SASUSER.PARMS.PARMS.SLIST in update mode. Temporary parameter values will be saved to WORK.PARMS.PARMS.SLIST.
, it seemed to have no issues and printed out the listing of cars:
5 The SAS System 17:46 Friday, April 27, 2018
31 ods listing close;ods html5 (id=saspy_internal) file=_tomods1 options(bitmap_mode='inline') device=svg style=HTMLBlue;
31 ! ods graphics on / outputfmt=png;
NOTE: Writing HTML5(SASPY_INTERNAL) Body file: _TOMODS1
32
33 filename x 'S:\SANITIZED\test.csv' encoding='utf-8';proc export data=sashelp.cars outfile=x
33 ! dbms=csv replace;run;
NOTE: Unable to open parameter catalog: SASUSER.PARMS.PARMS.SLIST in update mode. Temporary parameter values will be saved to
WORK.PARMS.PARMS.SLIST.
34 /**********************************************************************
35 * PRODUCT: SAS
36 * VERSION: 9.4
37 * CREATOR: External File Interface
38 * DATE: 27APR18
39 * DESC: Generated SAS Datastep Code
40 * TEMPLATE SOURCE: (None Specified.)
41 ***********************************************************************/
42 data _null_;
43 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
44 %let _EFIREC_ = 0; /* clear export record count macro variable */
45 file X delimiter=',' DSD DROPOVER ;
46 if _n_ = 1 then /* write column names or labels */
47 do;
48 put
49 "Make"
50 ','
51 "Model"
52 ','
53 "Type"
54 ','
55 "Origin"
56 ','
57 "DriveTrain"
58 ','
59 "MSRP"
60 ','
61 "Invoice"
62 ','
63 "EngineSize"
64 ','
65 "Cylinders"
66 ','
67 "Horsepower"
68 ','
69 "MPG_City"
70 ','
71 "MPG_Highway"
72 ','
73 "Weight"
74 ','
75 "Wheelbase"
76 ','
77 "Length"
78 ;
79 end;
80 set SASHELP.CARS end=EFIEOD;
81 format Make $13. ;
82 format Model $40. ;
83 format Type $8. ;
84 format Origin $6. ;
85 format DriveTrain $5. ;
86 format MSRP dollar8. ;
87 format Invoice dollar8. ;
88 format EngineSize best12. ;
89 format Cylinders best12. ;
90 format Horsepower best12. ;
91 format MPG_City best12. ;
92 format MPG_Highway best12. ;
93 format Weight best12. ;
94 format Wheelbase best12. ;
95 format Length best12. ;
96 do;
97 EFIOUT + 1;
98 put Make $ @;
99 put Model $ @;
100 put Type $ @;
101 put Origin $ @;
102 put DriveTrain $ @;
103 put MSRP @;
104 put Invoice @;
105 put EngineSize @;
106 put Cylinders @;
107 put Horsepower @;
108 put MPG_City @;
109 put MPG_Highway @;
110 put Weight @;
111 put Wheelbase @;
112 put Length ;
113 ;
114 end;
115 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
116 if EFIEOD then call symputx('_EFIREC_',EFIOUT);
117 run;
NOTE: The file X is:
Filename=S:\SANITIZED_PATH\test.csv,
RECFM=V,LRECL=131068,File Size (bytes)=0,
Last Modified=27Apr2018:17:46:51,
Create Time=27Apr2018:17:46:51
NOTE: 429 records were written to the file X.
The minimum record length was 68.
The maximum record length was 123.
NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.04 seconds
428 records created in X from SASHELP.CARS.
NOTE: "X" file was successfully created.
NOTE: PROCEDURE EXPORT used (Total process time):
real time 0.26 seconds
cpu time 0.23 seconds
118
119 ods html5 (id=saspy_internal) close;ods listing;
120
********************************************************************************
6 The SAS System 17:46 Friday, April 27, 2018
123 ods listing close;ods html5 (id=saspy_internal) file=_tomods1 options(bitmap_mode='inline') device=svg style=HTMLBlue;
123 ! ods graphics on / outputfmt=png;
NOTE: Writing HTML5(SASPY_INTERNAL) Body file: _TOMODS1
124
125 data _null_; infile x; input @; put _infile_;run;
NOTE: The infile X is:
Filename=S:\SANITIZED_PATH\test.csv,
RECFM=V,LRECL=131068,File Size (bytes)=38145,
Last Modified=27Apr2018:17:46:51,
Create Time=27Apr2018:17:46:51
Make,Model,Type,Origin,DriveTrain,MSRP,Invoice,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length
Acura,MDX,SUV,Asia,All,"$36,945","$33,337",3.5,6,265,17,23,4451,106,189
...
Volvo,XC70,Wagon,Europe,All,"$35,145","$33,112",2.5,5,208,20,27,3823,109,186
NOTE: 429 records were read from the infile X.
The minimum record length was 68.
The maximum record length was 123.
NOTE: DATA statement used (Total process time):
real time 0.09 seconds
cpu time 0.11 seconds
126
127 ods html5 (id=saspy_internal) close;ods listing;
128
I'm sorry I meant your data set that doesn't work. Not cars. I just did that for the example.
Got it. I ran the following:
filename = sas.workpath+"test2.csv"
ll = sas.submit("filename myfname '"+filename+"' encoding='utf-8';proc export data=MYLIB.MYTABLE outfile=myfname dbms=csv replace;run;")
print(ll['LOG'])
print ("*"*80)
ll = sas.submit("data _null_; infile myfname; input @; put _infile_;run;")
print(ll['LOG'])
The log output included the data from all rows and columns.
Additionally, I then ran the following to generate a list of uncommon bytes:
import string
data = set(open("job4_data.txt").read())
print (sorted(data - set(string.ascii_letters + string.digits)))
My output was the following:
['\x00', '\x01', '\x02', '\x03', '\x04', '\x05', '\x06', '\x07', '\x08', '\t', '\n', '\x0c',
'\x10', '\x11', '\x12', '\x13', '\x14', '\x15', '\x16', '\x17', '\x18', '\x19', ' ', '!',
'"', '#', '$', '%', '&', "'", '(', ')', '+', ',', '-', '.', '/', ':', ';', '=', '?',
'@', '_', '`', '\x80', '\x81', '\x82', '\x84', '\x86', '\x90', '\x92', '\x93',
'\x94', '\x98', '\x99', '\x9a', '\x9c', '\x9d', '\x9e', '\xa0', '\xa1', '\xa2',
'\xa6', '\xac', '\xb0', '\xbd', '\xbf', '\xc2', '\xc6', '\xcb', '\xe2', '\xef']
Well, since you got all the data written to the log, with no errors or issues, I don't know how the data is not making it back to the python side. When you try the to_df_CSV() method and it get's no data in the local file, and then gets the pandas error, is your sas session still active and functioning? The Java IOM process hasn't terminated? You can still run other things? I have a data set I created with every hex value from 0x00 to 0xff and I can transfer it w/ the _CSV() method without problem. I'm just not seeing how this is getting nothing over to the local file.
I ran the following code:
data = sas.sasdata(tablename, libname)
df = data.to_df_CSV(tempfile='mytempcsv.csv', tempkeep=True)
print(session.saslog())
As before, the CSV file was empty and it raised a pandas.errors.EmptyDataError: No columns to parse from file
.
could you still run other methods after that? Was your session still working?
Yes, the session was still working and I was able to retrieve other data without issue.
Is it possible that the byte/encoding issue only manifests when returning a table with multiple columns and we've both been only testing the nonprintable characters on single-column tables?
Thought I had replied to this, but I don't see it. I've tried to get this to fail for me. I've created a multi column table containing all 256 wlatin1 codepoints; semi randomly scrambled in the various columns. I still can't get this to fail. I am trying remote IOM to a workspace server running wlatin1 (cp1252). With the _CSV methods I can keep the temporary local file, and I see it's got the data in utf-8 encoding, which is then imported successfully into a dataframe. I've tried it with local IOM too, and get the same results. nls1.html.txt
Any chance you can see if the workspace server logs show any issues on the IOM server side? I can't see why there would be no errors or anything, yet the data won't show up on the client side.
Maybe you can try this example and see if it works or has issues for you?
Tom
BTW, just remove the .txt off the file name. Can't upload .html to this.
@moshekaplan I've found some edge cases where I was getting bad transcoding with both to_df and to_df_CSV. I've got fixes in a new branch called nls2. Are you able to try out this branch with your failing cases and see if they work correctly now?
Thanks! Tom
Sorry for the delayed response. I tested the newest version of master (which includes your three nls2 commits) and it appears to have solved the issue.
Thank you! Moshe
@moshekaplan That's great new, thanks for validating that for me! Yes, I was reading the stream of data, in chncks and concatenating them, but I was converting each chunk to character instead of concatenating as binary then converting to char after. So, it was only causing a problem when a multibyte character was split between chunks I was getting. So it was 'data specific', although completely a logic error on my part. I am glad this is what you were hitting and that this fixes it!
Thanks again for all the help looking into this! Thanks, Tom
Is there a recommended way to export a SAS data to a CSV file on the system running Python? The only relevant method appeared to be
SASdata.to_csv
, which writes it to the SAS system's disk.