sassoftware / saspy

A Python interface module to the SAS System. It works with Linux, Windows, and Mainframe SAS as well as with SAS in Viya.
https://sassoftware.github.io/saspy
Other
366 stars 149 forks source link

sasdata mode DISK/CSV tempkeep does not seem to work #599

Closed rainermensing closed 2 months ago

rainermensing commented 2 months ago

I am using saspy==5.1.2 and python 3.10 on Windows.

I am trying to load a file and keep the tempfile like so:

sasdf = sas.sasdata("table","libref")
df = sasdf.to_df_CSV("C:/tmp/test",tempkeep = True)

However, C:/tmp/test is not there after the load is finished. I also tried it using DISK, same issue.

Thank you al lot for your help!

rainermensing commented 2 months ago

FYI my general issue is that I am trying to retain load data from sas and store it to a local file. Since my source files are large and do not fit into memory, I am currently paginating through the table and then write the data to a parquet file page by page.

Is it possible to stream the data directly to a csv file or similar without loading the full table into memory using saspy? ODBC/JDBC is not possible in my context, unfortunately.

tomweber-sas commented 2 months ago

Are you connected to a local Windows install of SAS, or remote to a Workspace server?

rainermensing commented 2 months ago

A remote workspace server

tomweber-sas commented 2 months ago

Ah, yes the temp file is deprecated except for a local Windows configuration. The CSV method was the slowest and not the best as it used proc export and had to write the file to disk before importing it into pandas. I reworked that years ago to stream directly into Pandas for all of those variants. So the _CSV and _DISK versions aren't needed anymore, but they are still there. The base to_df() (sd2df() ...) is the fastest and most capable version. I only left the tempfile in there for a local Win connection, as it could be quicker than streaming the data between processes (SAS wrote the file to disk, and python read it in). But you can't do that remotely.

Parameters
tempfile – [deprecated except for Local IOM] [optional] an OS path for a file to use for the local CSV file; default it a temporary file that’s cleaned up

tempkeep – [deprecated except for Local IOM] if you specify your own file to use with tempfile=, this controls whether it’s cleaned up after using it

If I understand your use case, you could use to_csv() from your SASData object to write the data to a .csv file on the server and then download('local.csv', 'server.csv') then you can process the full csv file form python as you like. Is that what you're looking for?

tomweber-sas commented 2 months ago

Using the regular to_df() is the best, better than using CSV, so what you're doing, if I understand, is using df = to_df(firstobs, obs) to iterate through the data set, and then writing out the df (appending it if that works), to a parquet file. That would work too, and probably not be much slower; would have to try both ways to see though.

rainermensing commented 2 months ago

Thank you, ok then sorry I though that "[deprecated except for Local IOM] " meant it is not recommended but still an option.

My ideal scenario would be it would stream directly into a parquet writer like pyarrow. The option you described seems feasible, but would likely cost of lot of time? Also, going via csv would likely cause issues in the serialization/conversion of datatypes etc. But thank you again for your quick response.

tomweber-sas commented 2 months ago

I only have data transfer between Pandas and SAS tables currently, no direct to arrow or other formats unfortunately.

rainermensing commented 2 months ago

Alright, if I have time I will look deeper at your code and see if I can make a POC. Thanks again for your time!