sassoftware / saspy

A Python interface module to the SAS System. It works with Linux, Windows, and Mainframe SAS as well as with SAS in Viya.
https://sassoftware.github.io/saspy
Other
366 stars 149 forks source link

UPDATE: preserve sas timestamps & pass through pyarrow kwargs #607

Closed rainermensing closed 1 week ago

rainermensing commented 1 week ago

Hello @tomweber-sas , I think we closed the testing of our sasdata2parquet staging method (insofar as we have merged this version into our productive workflow, so it should better work as expected haha).

Preserve original sas timestamps

The main visible change in this version is that the timestamps in the final parquet file will be the timestamps that can be found in sas. This also means, however, that loading the parquet with pandas will lead to OutOfBounds timestamp issue when very large (positive or negative) timestamps are contained in it. In this case, the user will need to develop a custom "safely_load_pandas" method that parses the affected columns. But this is still better than just coercing the timestamp to be null by default. Anyways, the raison d'être of the sasdata2parquet method are scenarios where the Data would not fit into memory and will instead be processed by distributed frameworks like spark.

Pass through pyarrow kwargs

I further made changes to the parameters. I found wiring individual parameters from the method head to the pyarrow.Table.read_pandas() method or the ParquetWriter class quite constraining for the user. Instead, I give advanced users the ability to pass all possible parameters to these methods using a kwargs dictionary (pa_pandas_kwargs, pa_parquet_kwargs). This accounts for any edge cases that might not be covered in our own workflow and that future users might encounter. We, for example, use it to pass our own pyarrow schema:

sas.sasdata2parquet("test.parquet",table=table,libref=libref,
                             pa_parquet_kwargs = {"schema":pa_schema, "compression": 'snappy',"flavor": "spark","write_statistics": False })

There are also some other bug fixes regarding null columns, etc., but the above two are the main "features" relevant to the user.

IMPORTANT: Please note that I have developed this method for a client, and my contract will close by the end of this month. This also means that I won't be able to do any more testing (and hence development), since I do not have my own SAS server. I have passed this on to another colleague who will follow your progress on this issue and update the saspy once it is released. If you want me to respond to anythink tht might come out of your tests, it will be able to do so until Wednesday June 26th. Sorry for the short notice.

rainermensing commented 1 week ago

600

tomweber-sas commented 1 week ago

Hey @rainermensing I can try to see about integrating this. I'm a little confused as to why a bunch of changes that are already at main are pulled into this. I would expect only changes to what is actually in this branch be included here. There are changed I've made in this branch, when trying to integrate your prototype into all three access methods, that are undone in this PR also; your changes in this PR were not based upon the code here, but still on your original prototype from a previous version of main. I can try to unwinds all of that, but it will take some time, to try to reintegrate what is new here without missing anything I already did previously and included into the other access methods. I don't know if it would be easier for you to start with what's in this branch and just add the new parts or for me to try to unwind it and integrate what's new and push that all to the branch. I'll have to do that in the end anyway, to add this to the other AM's, but if there were only differences between what in this branch and what's new, that would be easier.

rainermensing commented 1 week ago

@tomweber-sas I am sorry I had merge your main branch into my parquet branch. I undid that now an recommitted my changes to the old version. It should be ok now.

tomweber-sas commented 1 week ago

Thanks! this is much cleaner to merge in. I'll still need to pull this and rework some of it to integrate into the other 2 access methods. I'll try to get to that today!