sassoftware / saspy

A Python interface module to the SAS System. It works with Linux, Windows, and Mainframe SAS as well as with SAS in Viya.
https://sassoftware.github.io/saspy
Other
371 stars 150 forks source link

Polars data frame support #610

Open ifly6 opened 1 month ago

ifly6 commented 1 month ago

Is your feature request related to a problem? Please describe. Export SAS data sets to Polars allowing for data type reinference

Describe the solution you'd like Split the existing SASIOSTDIO method that writes CSVs into two sections, one which creates the memory data frame, then two separate handlers for construction.

At that file line 2411, return sockout, then allow for passing around the file handle as well.

This splits the memory CSV from library read implementation. It may also need a context manager for this socket object to close it off well.

Describe alternatives you've considered I'm currently exporting to Pandas, then converting to Polars by constructing a massive CSV in memory (so I can reinfer data types). This is unnecessarily inefficient.

Additional context Allow Pandas or Polars to re-infer data types; SAS data sets are not always set up to handle them properly. On line 2411, allow for omission of the dtype parameter.

This also can be set up for custom usage if the carefully constructed handle sockout is exposed.

tomweber-sas commented 1 month ago

Interesting. I have a number of thoughts about this, in different directions 😊 So, in no particular order:

1) it would be good to be able to transfer to other than only pandas. There are a lot of newer packages these days, like Arrow, Parquet and this Polars I never heard of till now. SAS doesn’t have corresponding engines that I could use to get the data in the way these packages want it. So, I have to do a bunch of work to stream the data in the correct format so the Python side can correctly parse it, including calling that code exactly right (all of its options) so it matches the data format. 2) CSV – I don’t really like the CSV versions of my data transfer routines since they use proc import and proc export to read or output the data which then I have no control of the format. Those 2 procs don’t works as well for this as my default versions of the methods where, as I said above, I format the data exactly as it needs to be, based upon how I’m getting the Python side to read and parse it; so the final data’s correct. 3) a user just contributed a new method (which I integrated into all 3 access methods), sasdata2parquet (sd2pq) which transfers a data set over and writes it out to a Parquet file (or partitioned files) so that you can get data that’s too big to keep in Pandas, and use other interfaces to process it; Arrow for instance. This Polars says it supports Parquet format and uses Arrow behind the scenes (or support it). I wonder if using that will address the issue you have. sas.sd2pq('outputfile', ‘table’, ‘libref’), then df = polars.read_parquet('outputfile') or whatever it has to import that. That would be good to try to see if it already solves this. 4) dtype – you mention wanting the python side to infer the dtypes. This is already supported. You can specify dtype= on the sd2xx() method. If you were to specify dtype=str, that would mean you can manipulate them after the fact because they will all just be strings but formatted correctly. Or you can provide formats (my_fmts=) for columns and specific dtypes you want to use so you can control the formatting and parsing to get the types you want. 5) The trouble with supporting these various Python packages comes down to matching the exact format of the data stream with the options and functionality of the Python packages ‘import’ routines. It took a long time to get it all matched up exactly with Pandas, esp. with missing values and embedded quotes and newlines. When I looked at the parquet import functions, I couldn’t get its options to align with the data format I already create for Pandas. So, that would have taken a lot of time to reengineer. And, the format and import options aren’t exactly the same in each access method, due to SAS differences in how those each work. Just another complication. This is also why I don’t like the proc import/export versions of this so much. 6) Due to the previous bullet, handing the socket back to the user and hoping they have everything exactly right with wherever they are sending it to and that they close it right, … makes me not really inclined to try to do that. But it is an interesting thought. I’m still going to think on that some.

So, can you try the sd2pq() method and see if you can then read in the parquet file(s) using Polars and see how that works for your use case. Seems the point of it is similar to what you’re trying to accomplish, though of course, an extra step, but it’s done for you. That method support dtype=, my_formats= as a pyarrow schema, so with those you ought to be able to get whatever ending types you want. I haven’t messed with the sd2pq and all of its options that much, other than verifying the default cases all work. The user had real cases they verified, of course, since they were running production jobs with this.

And, reply to any of the bullet items too.

Thanks, Tom

tomweber-sas commented 3 weeks ago

Did you ever try the sd2pq() method to see if it would work for your use case?