sassoftware / saspy

A Python interface module to the SAS System. It works with Linux, Windows, and Mainframe SAS as well as with SAS in Viya.
https://sassoftware.github.io/saspy
Other
366 stars 149 forks source link

Issue writing pandas dataframe to SAS dataset using df2sd #593

Closed kwangccc closed 1 month ago

kwangccc commented 3 months ago

Hi, I am trying to convert a pandas dataframe to sas (sas7bdat) format using df2sd. The dataframe dimensions are roughly 700 columns by 20,000 rows so not considered large. I currently have a run time of roughly 7 hours and counting without completion. I am wondering what the issue is with my code below. Thanks in advance.

Code: import saspy sas = saspy.SASsession(cfgname='default') sas.saslib(libref='dat', path='/sasdata/modeling/_datasamples/equifax/data') sas.df2sd(df = ONESCORE_CFN_102023, table = 'ONESCORE_CFN_102023', libref='dat')

sas object output:

Access Method = STDIO SAS Config name = default SAS Config file = /usr/local/anaconda3/lib/python3.9/site-packages/saspy/sascfg.py WORK Path = /saswork/work/SAS_work17840003A29F_ip-10-6-0-146.concertocard.com/ SAS Version = 9.04.01M7P08062020 SASPy Version = 4.7.0 Teach me SAS = False Batch = False Results = Pandas SAS Session Encoding = utf-8 Python Encoding value = utf_8 SAS process Pid value = 238239

tomweber-sas commented 3 months ago

Can you look at the SASLOG and see any information? The paths listed for the sascfg file and WORK library in what you have above look strange. Where is this running? Can you do anything else and look at the LOG to see what's happening on that session?

kwangccc commented 3 months ago

I am running using Jupyter Notebook, connecting to a remote server

for this session below is the screenshot for

import saspy sas = saspy.SASsession(cfgname='default') sas.saslib(libref='dat', path='/sasdata/modeling/_datasamples/equifax/data')

Screenshot 2024-03-14 at 11 26 52

I cannot run/see log for the step because it does not finish running

sas.df2sd(df = ONESCORE_CFN_102023, table = 'ONESCORE_CFN_102023', libref='dat')

tomweber-sas commented 3 months ago

Ok, that helps! Always better to show the output. So the thing is that STSIO uses sockets to transfer data for sd2df, df2sd, upload, download. So, I suspect there's something strange where however you're doing a remote connection that saspy isn't aware of, there probably something going on with that. I would expect if there was a socket problem it would error instead of hang, but I 'm not sure what this environment really is. Can you elaborate on this environment more; where SAS is, where Python is, and what's doing some remote connection? And, what happens when you interrupt the process (ctl-c or if jupyter, the little square that says interrupt the kernel)? What's in the saslog after that? It's bound to be a socket issue with the environment you're using.

kwangccc commented 3 months ago

For code:

saspy

output: <module 'saspy' from '/usr/local/anaconda3/lib/python3.9/site-packages/saspy/__init__.py'>

saspy.SAScfg

output: '/usr/local/anaconda3/lib/python3.9/site-packages/saspy/sascfg.py'

The environment is Visual Studio Code running Jupyter notebook file on a SSH remote server.

I ran separately the iris dataset (150 rows by 5 columns) and finished in 0.7 seconds. Could it be a dataframe size/datatype issue?

Below screenshot is the saslog after I interrupt the process:

Screenshot 2024-03-14 at 12 43 40

kwangccc commented 3 months ago

I change the code (removed sas.saslib(libref='dat', path='/sasdata/modeling/_datasamples/equifax/data')) and now it finishes running but no file is created:

sas.df2sd(ONESCORE_CFN_102023, 'ONESCORE_CFN_102023', '/sasdata/modeling/_datasamples/equifax/data')

SASLOG:

Screenshot 2024-03-14 at 12 50 19

tomweber-sas commented 3 months ago

You're saying df2sd() works fine with a different dataframe? Then that's something different. That last one with an invalid libref just got an error in SAS.

I see you have an old version of saspy; 4.7.0. I've been looking through the release notes to see if there's something I may have fixed that could account for this, before just asking to upgrade to the current release. I did find something that looks like it could be this. In your previous post, where I think you interrupted the hung df2sd, it shows it sitting on the send to stdio. I did make a fix in a release between what you're running and current prod version that sounds like it could be this.

[V5.1.2](https://github.com/sassoftware/saspy/releases/tag/v5.1.2)

[5.1.2] - 2023-04-28
Fixed -

Fixed Issue 541 showed a deadlock situation in the STDIO Access Method when the generated code for a
sd2df() call was long enough to block python trying to write that to STDIN because SAS was blocked writing
it out to the LOG, STDERR. So I addressed this so that the deadlock won't happen anymore. This requires no
code changes on your part.

So, can you update to the latest version and see if this is fixed? The easiest is:

pip uninstall -y saspy
pip install saspy

Thanks! Tom

kwangccc commented 3 months ago

I will try that with our IT team and reach out back, Thanks!

tomweber-sas commented 2 months ago

Hey, just checking on this. Make any progress? I was thinking I can close this and if you need more help later you could reopen it if necessary.

tomweber-sas commented 1 month ago

I'm going to close this, as it's in your court to upgrade to the latest version which has this fix in it. If you need anything else, feel free to reopen it!

Thanks, Tom