vatlab / sos-sas

SoS extension for SAS
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Issue "getting" data from Python/R into SAS and vice versa #4

Closed jrisi256 closed 4 years ago

jrisi256 commented 4 years ago

Hi,

Thanks for making such a great tool!

I manage an RStudio Server Pro installation for a variety of users on RedHat 8. We have a variety of data scientists who work in R, Python, and SAS. Some people have expressed interest in using this tool in their workflows so I'm working to get it all setup for them.

I have R versions 3.6.0 to 3.6.3 and Python versions 3.7.6 and 3.6.9. Additionally, I have SAS set-up.

On the SoS side, I have installed the following package using pip:

pip install sos sos-notebook sos-papermill sos-r sos-sas

I also have installed irkernel and feather for all versions of R AND saspy and sas_kernel for SAS. Each of the languages work fine on their own independently. R and Python are playing nice with each other. However, I cannot pass data to SAS (from any version of R or Python) nor can I get data from SAS (from any version of R or Python).

Below is what happens when I to pass data to SAS:

image

image

Below is what happens when I try to retrieve data from SAS:

image

I tried Googling around, but I couldn't really find anything helpful. Our set-up is pretty specific so if you need more information from me let me know.

BoPeng commented 4 years ago

Thank you very much for working on this. My problem is that I have just started a new job at a new institution (Baylor College of Medicine) and have no access to their computational environment (which hopefully has a SAS server) yet. I will get back to you as soon as I can.

@zhudakai : Is there a SAS server at Baylor?

jrisi256 commented 4 years ago

No worries! Thank you for the timely response. Would there be someone who might be able to help on Gitter? If not, I'll sit tight :)

BoPeng commented 4 years ago

Just make sure that you have read https://vatlab.github.io/sos-docs/running.html#sas carefully. In particular, sos-sas requires SAS 9.4 or higher.

jrisi256 commented 4 years ago

Ok. I looked into all the specifics around our session and environment. Posting them here for your convenience and whenever you have a chance to look at this.

Operating System: Red Hat 8 RStudio Server Pro - 1.2.5019 Jupyter Notebook - 6.0.3 Jupyter Lab - 1.2.6 Jupyter Core - 4.6.1

Python - 3.7.6 (Focal version where SoS, Jupyter, saspy etc. are installed) saskernel - 2.2.0 saspy - 3.2.0 ipykernel - 5.1.4 SoS - 0.21.5 SoS notebook - 0.21.7 SoS Papermill - 0.1.6 SoS R - 0.19.2 SoS SAS - 0.18.0

Python - 3.6.5 (also set-up as its own kernel) ipykernel - 4.8.2

R 3.6.3 R 3.6.2 R 3.6.1 R 3.6.0 irkernel - 1.1 (installed for each version of R, each has their own kernel set up)

All the above software is on a separate machine from our SAS install. I have configured sascfg.py in saspy correctly so the SAS kernel is working within Jupyter. Below is the information concerning SAS and the OS of that machine.

SAS - 9.4 - Maintenance 5 Operating System - Red Hat 7

zhudakai commented 4 years ago

Sorry guys, I've been preoccupied with my company emails. Yes we are running SAS server of some sort. We support SAS Studio as well as SAS Enterprise Guide. Both requires some back end metaserver. We even activated LDAP for authentication. Looking forward to helping in any way On Tuesday, April 7, 2020, 01:21:11 PM CDT, jrisi256 notifications@github.com wrote:

Ok. I looked into all the specifics around our session and environment. Posting them here for your convenience and whenever you have a chance to look at this.

Operating System: Red Hat 8 RStudio Server Pro - 1.2.5019 Jupyter Notebook - 6.0.3 Jupyter Lab - 1.2.6 Jupyter Core - 4.6.1

Python - 3.7.6 (Focal version where SoS, Jupyter, saspy etc. are installed) saskernel - 2.2.0 saspy - 3.2.0 ipykernel - 5.1.4 SoS - 0.21.5 SoS notebook - 0.21.7 SoS Papermill - 0.1.6 SoS R - 0.19.2 SoS SAS - 0.18.0

Python - 3.6.5 (also set-up as its own kernel) ipykernel - 4.8.2

R 3.6.3 R 3.6.2 R 3.6.1 R 3.6.0 irkernel - 1.1 (installed for each version of R, each has their own kernel set up)

All the above software is on a separate machine from our SAS install. I have configured sascfg.py in saspy correctly so the SAS kernel is working within Jupyter. Below is the information concerning SAS and the OS of that machine.

SAS - 9.4 - Maintenance 5 Operating System - Red Hat 7

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

BoPeng commented 4 years ago

Reading more closely on your ticket now while the SAS server is being configured.

No idea on the put data to SAS part, but to get data out, we are doing something very simple here, namely we create a temporary directory, and run the following in SAS

https://github.com/vatlab/sos-sas/blob/181810a59e50ceba17fcf7d27149ec426fae5b94/src/sos_sas/kernel.py#L83-L87

libname TEMP 'path_to_temp';
Data TEMP.dat_1;
   set CLASS;
run;

and we assume that a file named data_1.sas7bdat would be created under path_to_temp. With your version of SAS, what file would be created with this statement?

BoPeng commented 4 years ago

@jrisi256 There is also a possibility that the Jupyter server and SAS server do not share the same /tmp directory... If that is the case, we have to find a common directory so that the Jupyter server can read the /tmp/... sas7bdat file written by the SAS server.

zhudakai commented 4 years ago

dat_1.sas7bdat We're running SAS 9.4 (TS1M6)

BoPeng commented 4 years ago

Since SAS 9.4 is using the same file format, so @jrisi256 must have the SAS server and Jupyter server on two file systems. Let me check if there is a good solution to that... In many cases these servers share the same $HOME etc, but not /tmp.

jrisi256 commented 4 years ago

That is correct! The SAS server and the Jupyter server are on two different machines. Currently they don't share any directories or anything like that. We could set it up so that there was mounted directory that both of them could access. How would we configure SoS then to use this new shared directory (rather than /tmp)?

BoPeng commented 4 years ago

As far as I remember, saspy did not support this (sas and jupyter on different filesystems) when we developed sos-sas. The SoS -> SAS problem was caused by the use of newer versions of saspy, and the SAS -> SoS problem was caused by the separation of file systems. I have fixed the former and is working on the latter... It should be ready by the end of this week.

zhudakai commented 4 years ago

Are you guys talking about something like NFS mount or Samba share? If a file system is a shared resource, any host can mount a slice, just make sure SAS and Jupyter are mounting the same slice - sorry if I'm off target here

BoPeng commented 4 years ago

Yes and no. Yes, we could use some directory that is shared between two services, but in general if saspy allows across file system communication, sos-sas should do so as well. There are some technical problems, in particular I am not sure if the jupyter ioPub channel can be used to pass large amount of data, but I should be able to come up with something that @jrisi256 can test.

zhudakai commented 4 years ago

Bo, I’m really interested in all of this. A novice question: can OpenMPI be used as an API? I know building a system around file systems is silly. Thanks. Dakai

Sent from my iPhone

On Apr 15, 2020, at 17:46, Bo notifications@github.com wrote:

 Yes and no. Yes, we could use some directory that is shared between two services, but in general if saspy allows across file system communication, sos-sas should do so as well. There are some technical problems, in particular I am not sure if the jupyter ioPub channel can be used to pass large amount of data, but I should be able to come up with something that @jrisi256 can test.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

BoPeng commented 4 years ago

Short answer is no.

To ensure maximum compatibility, sos-notebook takes a wrapper approach to existing kernels. It can "inject" commands to kernels and "tap" output from kernels, but does not change subkernels in anyway, or know anything about the kernel.

In case of sos-sas, SoS Notebook currently sends a statement to the SAS kernel to save the dataset to a file, "pick up" the output from the file system, and load the dataframe to SoS, effectively "transferring" a dataframe from SAS to SoS. However, when SoS and SAS do not share file system, SoS failed to pick up and the bug happened.

There are two solution here. I really would like to use SAS kernel to dump the data out but to do that I will have to let SAS "print" out the dataset so that SoS can get the output and create a dataframe. The problem right now is that the SAS Kernel prints everything in HTML. This means if I want to output a large dataset, all output will be wrapped in things like <span>123</span>, which is a HUGE waste of bandwidth and I do not know if it will work for reasonably large dataset.

Another option is to grab the dumped files directly from the SAS server, which could work if the connection is through SSH, but will not work if the connection is through a JAVA library (the IOM method).

I am debating which method I should use. Perhaps I can implement second method for Linux-based servers, and then add the first method later for windows-based servers. It would be slower but windows users can already tolerate windoze....

zhudakai commented 4 years ago

sas can output in clear text format such as a .lst file, but it's still formatted. it has a lot less tags, easier to be cleaned up. it can also dump into PDF format - suppose you have an API? if you insist, sas has a driver to write out .csv. there are a few other format really - not all are file system dependent. i personally would use a put/file combo so I can get a data file for next stop - what if you write to stdout? 

in this blog: https://communities.sas.com/t5/SAS-Programming/How-to-redirect-put-statements-to-the-result-instead-of-the-log/td-p/513618

sas -stdio <(echo 'data null; file stdout; put "hello, foobar"; run;') 2>/dev/null

BoPeng commented 4 years ago

No, writing file is not the problem here. The python end can read sas7bdat, csv or whatever SAS saves, but it need to access the file that in this case on the remote host.

After playing around a bit, I found that if I run

%put CLASS2
DATA CLASS2;
   INPUT NAME $ 1-8 SEX $ 10 AGE 12-13 HEIGHT 15-16 WEIGHT 18-22;
CARDS;
JOHN.    M 12 59 99.5
PETER    F 15 59 9.5
PROC PRINT;
RUN;

in SAS, a file called class.sas7bdat would appear in a directory that is under the work directory on the SAS server.

%put %sysfunc(getoption(work));

I can retrieve the class.sas7bdat file with scp now, but could you let me know if this is always the case? That is to say, for whatever dataset in SAS, they will be under this work dir?

I think an obvious exception is for user to do

libname NAME 'path'

but how SAS creates dataset in NAME? Sorry, I have not used SAS in the past 15 years and forgot all about it.

BoPeng commented 4 years ago

With the latest patch, remote SAS server works if it is linux-based that is connected through ssh.

image

image

But more tests are certainly needed.

@jrisi256 You are welcome to build sos-sas from GitHub and test if it works in your environment, but only if you are comfortable with building python packages from source and testing unstable source code. I will test the module in the next few days and will release the next version of sos-sas afterwards.

zhudakai commented 4 years ago

libname defines where a permanent dataset should be stored, otherwise, all temporary datasets should be destroyed when SAS ends, but if one kills SAS forcibly then there would be no chance for garbage collection to occur - this is a sysadmin’s headache. Anyway, the temporary directory or working directory can be defined post installation, in a file sasv9.conf. This can be customized with sasv9_local.conf. A user can also define this in his home directory to supersede my system settings. I frequently would change default from /tmp to /scratch/SAS, given /scratch always seems to be much bigger. Other things I change is the memsize and swapsize, again, I have to have these especially large

Sent from my iPhone

On Apr 15, 2020, at 22:23, Bo notifications@github.com wrote:

 No, writing file is not the problem here. The python end can read sas7bdat, csv or whatever SAS saves, but it need to access the file that in this case on the remote host.

After playing around a bit, I found that if I run

%put CLASS2 DATA CLASS2; INPUT NAME $ 1-8 SEX $ 10 AGE 12-13 HEIGHT 15-16 WEIGHT 18-22; CARDS; JOHN. M 12 59 99.5 PETER F 15 59 9.5 PROC PRINT; RUN; in SAS, a file called class.sas7bdat would appear in a directory that is under the work directory on the SAS server.

%put %sysfunc(getoption(work)); I can retrieve the class.sas7bdat file with scp now, but could you let me know if this is always the case? That is to say, for whatever dataset in SAS, they will be under this work dir?

I think an obvious exception is for user to do

libname NAME 'path' but how SAS creates dataset in NAME? Sorry, I have not used SAS in the past 15 years and forgot all about it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

zhudakai commented 4 years ago

Regarding SAS working directory, I wonder if there’s an environment variable you can tap into? We can take a look in our RedHat 7 host

Sent from my iPhone

On Apr 15, 2020, at 22:45, Bo notifications@github.com wrote:

 With the latest patch, remote SAS server works if it is linux-based that is connected through ssh.

But more tests are certainly needed.

@jrisi256 You are welcome to build sos-sas from GitHub and test if it works in your environment, but only if you are comfortable with building python packages from source and testing unstable source code. I will test the module in the next few days and will release the next version of sos-sas afterwards.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

zhudakai commented 4 years ago

Once in SAS, variable WORK is the location of temporary files of all kinds. Hence

data one;

Is the same as

data WORK.one;

WORK must be a reserved libname then

Sent from my iPhone

On Apr 15, 2020, at 22:45, Bo notifications@github.com wrote:

 With the latest patch, remote SAS server works if it is linux-based that is connected through ssh.

But more tests are certainly needed.

@jrisi256 You are welcome to build sos-sas from GitHub and test if it works in your environment, but only if you are comfortable with building python packages from source and testing unstable source code. I will test the module in the next few days and will release the next version of sos-sas afterwards.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

BoPeng commented 4 years ago

So if I understand you correctly,

data WHATEVER

is the same as

data WORK.WHATEVER

which is under a directory that can be retrieved (however the system is set up) with the following command.

%put %sysfunc(getoption(work));

This is the case that I have already handled.

Now, is there any command to get the path to mylib

Data mylib.data

where mylib is defined somewhere with

libname mylib path

What I want to do is, when user tries to do

%get mylib.data --from SAS

from a SoS kernel, sos-sas should try to get the path of mylib, and then scp the path_to_mylib.data.sas7bdat to the local file system to read into SoS.

zhudakai commented 4 years ago

Define and use libname will result permanent datasets to be read/written to a custom location - key word is 'permanent'. WORK on the other hand should be cleared up if SAS can exist normally. In our case, if you go to /scratch/sastemp/, you'd see SAS garbage people dumped when their jobs were terminated abnormally  For your purpose, tapping into -WORK environmental variable inside SAS is the best bet. If a libname is defined and you can tap into it then that should cover more datasets. Do you give people an option here? 

BoPeng commented 4 years ago

Yes and no again because I do not know SAS well.

For example, if you have a dataset under a customized libname,

libname mylib "my path"
Data mylib.DATA ...

Can DATA be accessed without mylib?

Basically I am trying to decided if SoS should be more clever and do

%get DATA --from SAS

or users have to prefix with libname

%get mylib.DATA --from SAS

Note that in case of WORK, we ignored WORK and used dataset name directly.

finkbine commented 4 years ago

I never used Sos, SoS looks so flexible in sharing objects among platforms, I agree with Dakai that it is necessary to assign libname. My basic idea is that the object (sas, R, python) created should be tracked with its full address, maybe work (SAS) scenario is very hard to track, since it depends on the initial assignment when sas was installed.

finkbine commented 4 years ago

sample codes above seem to work on the current working space, another question is for module sos-sas engine, if it does not depend on current working environment, how does it get input feed and generate sas dataset?

BoPeng commented 4 years ago

@finkbine Let us do this one by one. If you create a dataset under another directory with libname mylib "path", can it be assessed without libname?

finkbine commented 4 years ago

in sas, a prefix (libname) must be used to reference a dataset, basically, libname my lib "path" means path already exists, and if you want to use datasets within path, you have to write path.***

BoPeng commented 4 years ago

and the only exception is "work"...

Second question, say you have create a mylib, what is the command for me to get the directory of mylib?

finkbine commented 4 years ago

in sas, as far as I know, libname was created by user, so user should know what he is doing. in a sas session, any number of instances (libnames) could be initiated

finkbine commented 4 years ago

in SoS, is there any mechanism to track what users types in sas code, for example, searching keyword libname

BoPeng commented 4 years ago

No. I am using

%put %sysfunc(getoption(work));

to get the directory of WORK. I actually saw something online that says

%put %sysfunc(pathname(mylib));

can be used to get path of mylib. Could you check if this is true?

finkbine commented 4 years ago

I see, I never used this option since it is rare to me. Yes, this command can give absolute path for libname, and retrieve datasets.

BoPeng commented 4 years ago

Great. Then could you please adapt the "CLASS" example to an example in a customized libname so that I can test it? The workflow would be something like

libname mylib path
create CLASS inside mylib

Then from SoS

%get mylib.CLASS --from SAS

should grab mylib.CLASS and create a dataframe. I have not decided what name the dataframe should be called. It could be just CLASS, or a namespace with CLASS inside so the dataframe would be called mylib.CLASS.

finkbine commented 4 years ago

did you implement these engines with default python, R , sas functions? sometimes, format problem could be a concern.

finkbine commented 4 years ago

Maybe Dakai can do it? Currently, I do not know how to use it.

BoPeng commented 4 years ago

Thanks @finkbine . That is ok, I think I can figure out.

jrisi256 commented 4 years ago

Hi @BoPeng thanks so much for all the hard work you're doing in trying to resolve this issue. I am an admin for a group of users so as a policy I don't download/install unstable development releases. Since this isn't an urgent issue for us, I am going to hold off until you officially release the fix.

In the mean time, if you need me to test something or if there is anything else I can to help, let me know. Sadly I don't know SAS at all either, haha. I just help set this up for people to use.

BoPeng commented 4 years ago

@finkbine I have just released sos-sas version 0.19.0. It has been tested in our SAS environment (jupyterlab/mac + SAS/Linux) and should hopefully also work in yours. Please check the updated sos-sas documentation for details and let me know if you encounter any more bug.

Note that remote %get from windows SAS servers will not work, but I will leave it for later.