sassoftware / saspy

A Python interface module to the SAS System. It works with Linux, Windows, and Mainframe SAS as well as with SAS in Viya.
https://sassoftware.github.io/saspy
Other
367 stars 149 forks source link

SASPy sas7bdat to pandas df using sd2df #541

Closed kwangccc closed 1 year ago

kwangccc commented 1 year ago

Hi I am trying to run the following code to convert a sas7bdat dataset to pandas df using sd2df through a remote connection of Jupyter notebook; however, I am running into timeouts in processing.

Code: sas = saspy.SASsession(cfgname='default')

rc = sas.submit(""" libname dat '/sasdata/modeling/_datasamples/transunion/data' cvpbytes=100;

data tu_archive_20230411; set dat.p2_tudata_renamed; run; """)

sas.HTML(rc['LST']) print(rc['LOG']) tu_archive = sas.sd2df(table = 'tu_archive_20230411', libref = 'work', method = 'DISK')

tomweber-sas commented 1 year ago

Well, I can't see anything shat shows me what happened. Can you show me the output from running these steps? Also submit your SASsession object so I can get some information from that (just submit sas in a cell after you get the connection).

Actually, how about running the following and showing all of the output. If you can just run this in a python shell, it's all be easier to cut-n-paste, but however you can run it. It's the same as what you're doing, just simpler and with some more information.

cell1: sas = saspy.SASsession(cfgname='default') sas

cell2: sas.submitLOG(""" libname dat '/sasdata/modeling/_datasamples/transunion/data' cvpbytes=100;

data tu_archive_20230411; set dat.p2_tudata_renamed; run; """)

cell3: sd = sas.sasdata('tu_archive_20230411') sd.contents(results='text')

cell4: tu_archive = sd.to_df()

Thanks, Tom

kwangccc commented 1 year ago

Hi Tom,

I have the following outputs for the code:

cell1: SAS Connection established. Subprocess id is 1328572

Access Method = STDIO SAS Config name = default SAS Config file = /usr/local/anaconda3/lib/python3.9/site-packages/saspy/sascfg.py WORK Path = SAS Version = 9.04.01M7P08062020 SASPy Version = 4.7.0 Teach me SAS = False Batch = False Results = Pandas SAS Session Encoding = utf-8 Python Encoding value = utf_8 SAS process Pid value = 1328572

cell2:

21 ods listing close;ods html5 (id=saspy_internal) file=stdout options(bitmap_mode='inline') device=svg style=HTMLBlue; ods 21 ! graphics on / outputfmt=png; NOTE: Writing HTML5(SASPY_INTERNAL) Body file: STDOUT 22
23
24 libname dat '/sasdata/modeling/_datasamples/transunion/data' cvpbytes=100; NOTE: Libref DAT was successfully assigned as follows: Engine: CVP Physical Name: /sasdata/modeling/_datasamples/transunion/data 25
26 data tu_archive_20230411; NOTE: Data file DAT.P2_TUDATA_RENAMED.DATA is in a format that is native to another host, or the file encoding does not match the session encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce performance. 27 set dat.p2_tudata_renamed; 28 run; NOTE: There were 143371 observations read from the data set DAT.P2_TUDATA_RENAMED. NOTE: The data set WORK.TU_ARCHIVE_20230411 has 143371 observations and 2063 variables. NOTE: DATA statement used (Total process time): real time 29.84 seconds cpu time 28.77 seconds

29
30
31 ods html5 (id=saspy_internal) close;ods listing;

cell3:

                                                       The SAS System                  Wednesday, April 12, 2023 05:08:00 PM   1

                                                   The CONTENTS Procedure

         Data Set Name        WORK.TU_ARCHIVE_20230411                                 Observations          143371
         Member Type          DATA                                                     Variables             2063  
         Engine               V9                                                       Indexes               0     
         Created              04/12/2023 17:08:38                                      Observation Length    21360 
         Last Modified        04/12/2023 17:08:38                                      Deleted Observations  0     
         Protection                                                                    Compressed            NO    
         Data Set Type                                                                 Sorted                NO    
         Label                                                                                                     
         Data Representation  SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64                                
         Encoding             utf-8  Unicode (UTF-8)                                                               

                                             Engine/Host Dependent Information

 Data Set Page Size          131072                                                                                        
 Number of Data Set Pages    23898                                                                                         
 First Data Page             3                                                                                             
 Max Obs per Page            6                                                                                             
 Obs in First Data Page      1                                                                                             
 Number of Data Set Repairs  0                                                                                             
 Filename                   tu_archive_20230411.sas7bdat
 Release Created             9.0401M7                                                                                      
 Host Created                Linux                                                                                         
 Inode Number                536871966                                                                                     
 Access Permission           rw-r--r--                                                                                     
 Owner Name                                                                                     
 File Size                   3GB                                                                                           
 File Size (bytes)           3132489728                                                                                    

The list of contents is massive so I will skip over that. The code is still running for cell4. I am curious what the processing time is for a dataframe of this size: 143371 rows x 2063 columns.

kwangccc commented 1 year ago

The code runs and the content shows, however, it is the last step (converting to pandas) that takes a long time/never finish running until the server times out

tomweber-sas commented 1 year ago

Great thanks, that helps a lot. So, what do you mean the server times out? This is a STDIO connection, so saspy is the one starting a local SAS session and directly interacting with it. So there's no 'server timeout' to be had. Since it's a 3G dataset, it will take some time to transfer over. I can't tell you exactly so lets just try something else to see if there actually some other problem or not.

First, since you're just local on linux, can you run `top -d1' from another shell and see if both python and SAS are eating CPU while this is running?

The other thing is lets try this and see what we see:

sas.submitLOG('options fullstimer;') sd = sas.sasdata('tu_archive_20230411', dsopts={'obs' : 10}) # then try 1000, 10000 ... print(sas.lastlog())

And lets see what that shows. If it works fine, we can up the obs count and get a sense of how long it's taking and extrapolate for the whole 3G. If that fails or something, then we have something else to look into.

Tom

kwangccc commented 1 year ago

Hi Tom,

sorry for the confusion, I am connecting to a remote server that runs python/jupyter notebook, I understand that it is a large file so I wanted to gauge what the expected time is to convert. sd.to_df() has been running for roughly 25-30 minutes and I was wondering if that is expected or longer than usual

kwangccc commented 1 year ago

A work around I have is to read into SAS through SAS studio, saving it as a csv and reading it in python through csv, which is a lot faster, I just wanted to make sure there is nothing wrong with saspy, its configuration or my code in general

kwangccc commented 1 year ago

In your experience, how well does saspy handle large dataframe transitions as such?

tomweber-sas commented 1 year ago

I'm confused about you saying your connecting to a remote server? What you showed was that python is on linux and so is the SAS session (on the same linux box). I'm guessing jupyter is running there too. That's what you're describing as remote? Your browser showing the notebook is what's on your loca machine and everything else is on the remote linux machine?

Can you run the last set of code I provides so we can see if there is a problem or not?

I'm running what I believe is the same as what you have, with a 4G file right now. I'll post what I get on my machine for that when it's done. Can you run the code above and post the output?

Thanks, Tom

tomweber-sas commented 1 year ago

ok, mine came back in just under 4 minutes. The shape and content of the data isn't the same as yours, so there's still going to be variance, but it still didn't take 20 minutes. So we need to see what is actually going on, starting with seeing if it even works with a few observations or not.

tom64-7> python3
Python 3.9.12 (main, Apr  5 2022, 06:56:58)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import saspy
>>> sas = saspy.SASsession(cfgname='tdi'); sas
SAS Connection established. Subprocess id is 1915565

Access Method         = STDIO
SAS Config name       = tdi
SAS Config file       = /opt/tom/github/saspy/saspy/sascfg_personal.py
WORK Path             = /usr/tmp/SAS_work6F5F001D3AAD_tom64-7/
SAS Version           = 9.04.01M7P08052020
SASPy Version         = 5.0.2
Teach me SAS          = False
Batch                 = False
Results               = Pandas
SAS Session Encoding  = utf-8
Python Encoding value = utf_8
SAS process Pid value = 1915565

>>> sas.sascfg.SAScfg.tdi
{'saspath': '/tdi/mva-v940m7/usrlibsas/laxno/SASFoundation/9.4/bin/sas_u8', 'options': ['-encoding', 'utf8']}
>>> sas.saslib('x', path='/opt/tom')

21
22   libname x    '/opt/tom'  ;
NOTE: Libref X was successfully assigned as follows:
      Engine:        V9
      Physical Name: /opt/tom
23
>>> sd = sas.sasdata('saspy4g','x')
>>> sd
Libref  = x
Table   = saspy4g
Dsopts  = {}
Results = Pandas

>>> sd.contents(results='text')

                                                           The SAS System                  Wednesday, April 12, 2023 01:56:00 PM   2

                                                       The CONTENTS Procedure

            Data Set Name        X.SASPY4G                                                Observations          5000000
            Member Type          DATA                                                     Variables             115
            Engine               V9                                                       Indexes               0
            Created              11/02/2017 08:37:28                                      Observation Length    920
            Last Modified        11/02/2017 08:37:28                                      Deleted Observations  0
            Protection                                                                    Compressed            NO
            Data Set Type                                                                 Sorted                NO
            Label
            Data Representation  SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
            Encoding             latin1  Western (ISO)

                                                 Engine/Host Dependent Information

                                       Data Set Page Size          73728
                                       Number of Data Set Pages    62501
                                       First Data Page             1
                                       Max Obs per Page            80
                                       Obs in First Data Page      63
                                       Number of Data Set Repairs  0
                                       Filename                    /opt/tom/saspy4g.sas7bdat
                                       Release Created             9.0401M4
                                       Host Created                X64_10PRO
                                       Inode Number                412422804
                                       Access Permission           rwxr-xr-x
                                       Owner Name                  sastpw
                                       File Size                   4GB
                                       File Size (bytes)           4608147456

>>> import time
>>> time.time()
1681322242.7797878
>>>
>>>
>>> start=time.time(); df = sd.to_df(); print(time.time()-start)
231.27154350280762
>>> 231.27154350280762 / 60
3.8545257250467935
>>> df.shape
(5000000, 115)
>>>
kwangccc commented 1 year ago

Running your last code snippet now, I mean I connect using SSH through terminal, get a link for Jupyter to be ran on a browser through a remote server

kwangccc commented 1 year ago

Running the following:

sas.submitLOG('options fullstimer;') sd = sas.sasdata(table = 'tu_archive_20230411',libref = 'WORK', dsopts={'obs' : 10000000}) # then try 1000, 10000 ... print(sas.lastlog())

152 ods listing close;ods html5 (id=saspy_internal) file=stdout options(bitmap_mode='inline') device=svg style=HTMLBlue; ods 152! graphics on / outputfmt=png; NOTE: Writing HTML5(SASPY_INTERNAL) Body file: STDOUT 153
154 options fullstimer; 155
156 ods html5 (id=saspy_internal) close;ods listing;

158 ;';";*/; 159 %let engine=BAD; 160 proc sql; 160! select distinct engine into :engine from sashelp.VLIBNAM where libname = 'WORK';;%put engstart=&engine engend=; engstart=V9 engend= 161 quit; NOTE: The PROCEDURE SQL printed page 2. NOTE: PROCEDURE SQL used (Total process time): real time 0.00 seconds user cpu time 0.00 seconds system cpu time 0.00 seconds memory 5196.50k OS Memory 18600.00k Timestamp 04/12/2023 06:30:39 PM Step Count 17 Switch Count 0 Page Faults 0 Page Reclaims 24 Page Swaps 0 Voluntary Context Switches 2 Involuntary Context Switches 0 Block Input Operations 0 Block Output Operations 8

162
163 ;';";/;%put %upcase(e3969440a681a2408885998500000024); E3969440A681A2408885998500000024 164 ;';";/; 165 data null; e = exist("WORK.'tu_archive_20230411'n"); 166 v = exist("WORK.'tu_archive_20230411'n", 'VIEW'); 167 if e or v then e = 1; 168 put 'TABLE_EXISTS=' e 'TAB_EXTEND=';run; TABLE_EXISTS=1 TAB_EXTEND= NOTE: DATA statement used (Total process time): real time 0.00 seconds user cpu time 0.00 seconds system cpu time 0.00 seconds memory 347.84k OS Memory 13476.00k Timestamp 04/12/2023 06:30:39 PM Step Count 18 Switch Count 0 Page Faults 0 Page Reclaims 13 Page Swaps 0 Voluntary Context Switches 0 Involuntary Context Switches 0 Block Input Operations 0 Block Output Operations 0

169
170 ;';";*/;%put %upcase(e3969440a681a2408885998500000025); E3969440A681A2408885998500000025

tomweber-sas commented 1 year ago

Sorry, I forgot the sd2df statement; that didn't do anything. run this:

sas.submitLOG('options fullstimer;') sd = sas.sasdata('tu_archive_20230411', dsopts={'obs' : 10}) df = sd.to_df() print(sas.lastlog())

kwangccc commented 1 year ago

Okay, now running:

sas.submitLOG('options fullstimer;') sd = sas.sasdata(table = 'tu_archive_20230411',libref = 'WORK', dsopts={'obs' : 10}) # then try 1000, 10000 ... df = sd.to_df() print(sas.lastlog())

10 observations and still running

tomweber-sas commented 1 year ago

ok, then there's something wrong it would seem. So, can you connect to the machine where this is actually running? I'm still not sure of what your configuration is. Can you cancel the submission and then try to print the log to see if there's anything there that helps? when you cancel it may ask you about canceling submitted statements (SAS prompting for this, not jupyter). If so follow the prompts and hopefully we may see something in the log.

cancel the submission then submit

print(sas.saslog())

and lets see if there's anything in it from the to dataframe method.

tomweber-sas commented 1 year ago

Can you show any of this data on here? If not you can email me the output. Can you run this and send me the output? BTW, I have a meeting in a couple minutes, so I'll get back after.

sd = sas.sasdata('tu_archive_20230411', results='text') sd.contents() # I need to see all the output including the variables sd.head()

Thanks, Tom

the output should be all of the contents and 5 rows of data (if I didn't make a typo in the code)

kwangccc commented 1 year ago

Tom,

How would I cancel the submission? I can stop the process in Jupyter? It would be better if I can email you the output rather than posting it on here.

tomweber-sas commented 1 year ago

in jupyter you click the square button tab to effectively cause a Ctl-C.

kwangccc commented 1 year ago

KeyboardInterrupt Traceback (most recent call last) Cell In[11], line 3 1 sas.submitLOG('options fullstimer;') 2 sd = sas.sasdata(table = 'tu_archive_20230411',libref = 'WORK', dsopts={'obs' : 10}) # then try 1000, 10000 ... ----> 3 df = sd.to_df() 4 print(sas.lastlog())

File /usr/local/anaconda3/lib/python3.9/site-packages/saspy/sasdata.py:1213, in SASdata.to_df(self, method, kwargs) 1211 if self.sas.sascfg.pandas: 1212 raise type(self.sas.sascfg.pandas)(self.sas.sascfg.pandas.msg) -> 1213 df = self.sas.sasdata2dataframe(self.table, self.libref, self.dsopts, method, kwargs) 1214 self.sas._lastlog = self.sas._io._log[lastlog:] 1215 return df

File /usr/local/anaconda3/lib/python3.9/site-packages/saspy/sasbase.py:1832, in SASsession.sasdata2dataframe(self, table, libref, dsopts, method, kwargs) 1830 df = None 1831 else: -> 1832 df = self._io.sasdata2dataframe(table, libref, dsopts, method=method, kwargs) 1834 self._lastlog = self._io._log[lastlog:] 1835 return df

File /usr/local/anaconda3/lib/python3.9/site-packages/saspy/sasiostdio.py:2156, in SASsessionSTDIO.sasdata2dataframe(self, table, libref, dsopts, rowsep, colsep, rowrep, colrep, port, wait, kwargs) 2153 return self.sasdata2dataframeCSV(table, libref, dsopts, port=port, wait=wait, kwargs) 2154 #elif method and method.lower() == 'disk': 2155 else: -> 2156 return self.sasdata2dataframeDISK(table, libref, dsopts, rowsep, colsep, 2157 rowrep, colrep, port=port, wait=wait, **kwargs)

File /usr/local/anaconda3/lib/python3.9/site-packages/saspy/sasiostdio.py:2557, in SASsessionSTDIO.sasdata2dataframeDISK(self, table, libref, dsopts, rowsep, colsep, rowrep, colrep, port, wait, **kwargs) 2554 quoting = kwargs.pop('quoting', 3) 2556 sock.listen(1) -> 2557 self._asubmit(code, 'text') 2559 if wait > 0 and sel.select([sock],[],[],wait)[0] == []: 2560 logger.error("error occured in SAS during sasdata2dataframe. Trying to return the saslog instead of a data frame.")

File /usr/local/anaconda3/lib/python3.9/site-packages/saspy/sasiostdio.py:753, in SASsessionSTDIO._asubmit(self, code, results) 750 if (ods): 751 pgm += odsclose --> 753 out = self.stdin.write(pgm) 755 self.stdin.flush() 757 return str(out)

KeyboardInterrupt:

tomweber-sas commented 1 year ago

Yeah, interrupting that didn't provide any useful info. Were you able to run that last set of code? I haven't gotten anything in email. just checking :)

kwangccc commented 1 year ago

Hi Tom, what is your email? I will send over the results of the last set of code.

tomweber-sas commented 1 year ago

Hey, it's on my account here, but it's just tom.weber@sas.com. Cool, thanks!

tomweber-sas commented 1 year ago

just checking, as I haven't gotten anything by email. This was the code I was interested it, so I can see what the data is like.

sd = sas.sasdata('tu_archive_20230411', results='text') sd.contents() # I need to see all the output including the variables sd.head()

Also, I assume other data works using sd2df(), and it's something specific to this data set, right? The following runs, right? df = sas.sd2df('cars','sashelp') df

Thanks, Tom

kwangccc commented 1 year ago

Hi Tom, I just sent the email over for the sd contents and head and df = sas.sd2df('cars','sashelp') works fine.

tomweber-sas commented 1 year ago

Thanks, I've received it and am looking. Can you confirm that other datasets transfer over via sd2df()? That the problem is specific to this data set?

tomweber-sas commented 1 year ago

oh, never mind, you just said it did in the last post.

tomweber-sas commented 1 year ago

Hey, that helped a lot. I was able to reproduce the problem! That's the good news. The bad news is this will take some time to fix. It's not a simple one. But, I have a workaround for you to try, which I think will get you working at least! can you try the following and see if it gets you working?

sas.submitLOG('options nosource;')
tu_archive = sas.sd2df(table = 'tu_archive_20230411', libref = 'work', method = 'DISK')
sas.submitLOG('options source;')

Thanks, Tom

kwangccc commented 1 year ago

Hi Tom,

The code above works! What is and how would I go about fixing the issue? Is it a configuration problem?

Thank you!

Best,

Kevin

tomweber-sas commented 1 year ago

Oh, it's not a problem of yours, Nothing for you to do. It's an issue in my code where the communication with SAS is being blocked because the code being submitted is so big; the STDIN is being blocked, so my python code can't continue, since SAS is writing all that code to the LOG which is blocking SAS on writing to STDERR, since I'm not reading it off there as I'm blocked trying to write to stdin. So it's a deadlock situation I have to address. That's why simply turning off 'source' is allowing it to work. But, that's not the real answer; just a workaround for you at the moment to get you back up and running.

I'll work on it next week and update you on my progress. Have a great weekend!

Thanks, Tom

kwangccc commented 1 year ago

Hi Tom,

Thank you for your help! Have a great weekend as well!

Best, Kevin

tomweber-sas commented 1 year ago

Hey @kwangccc , I've pushed a fix for this problem. It's at main right now. I was wondering if you're able to test it out with your use case? I like to have customers prove my fix works for them too, before building a new release with changes. But, it seems like you're using a deployment that you may not be able to update yourself? Let me know if you can try this out or not. I was able to reproduce it and I'm sure this is the same problem that you were having, so I'm not too concerned about it, if you can't try it until you get a version with this change deployed where you're running it.

Thanks, Tom

kwangccc commented 1 year ago

Hi Tom,

Yes I am unable to update myself. How would one go about updating? I can forward instructions to our IT team.

Thanks,

Kevin

tomweber-sas commented 1 year ago

Sweet! Yeah, the instructions are here: https://sassoftware.github.io/saspy/install.html

The simple thing to do is just this:

pip uninstall -y saspy
pip install git+https://git@github.com/sassoftware/saspy.git@main
tomweber-sas commented 1 year ago

Hey, I've tested this out enough on my end to prove it's solid. I went ahead and built a new release with this; V5.1.2. So, if they haven't updated the deployment from main yet, just have them update to this current production release.

pip uninstall -y saspy pip install saspy

I'll still hold off closing this till I hear it worked for you though. Thanks! Tom

tomweber-sas commented 1 year ago

Hey, I'm going to close this out since the fix is in the current production release. When you get a chance to try it out, if there's anything not working right, just reopen and let me know. I think you should be good though. If you need anything else, just open another issue!

Thanks, Tom