pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.87k stars 1.92k forks source link

Add SAS database read support (.sas7bdat) as pl.read_sas(filepath:str, **kwargs) #14587

Open louisbrulenaudet opened 8 months ago

louisbrulenaudet commented 8 months ago

Description

Dear developers,

As a proprietary language used at scale, it would be beneficial to introduce support for reading SAS backup files (.sas7bdat), so as not to have to use third-party libraries to perform a time-consuming and sub-optimal series of conversions.

Today, it is possible to proceed by using Dask to parralelize reading using pyreadstat, but it will then be necessary to convert the Dask DataFrame to Pandas, in order to convert the Pandas DataFrame to Polars, and conversion from Dask to Pandas is relatively slow and cumbersome in a production environment.

Two solutions can be envisaged: either Dask support within Polars, or SAS support to guarantee Polars' autonomous operation. Also, integrating progress bar support would be very useful, especially in view of the fact that .sas7bdat files are generally used for tables containing more than 1000 columns.

Best regards, Louis

alexander-beedie commented 8 months ago

Two solutions can be envisaged: either Dask support within Polars, or SAS support to guarantee Polars' autonomous operation

There's another solution; Arrow export from the existing SAS libraries - with that in place we could simply zero-copy the output into Polars without having to write an entire (complicated) SAS-parsing i/o stack (which I suspect there is little appetite for). Could be worth adding an Issue to the various projects, requesting efficient Arrow export 😉 Otherwise some intermediate conversions are likely the way to go for now...

Out of curiosity, what are the major domains that use these files? I've never come across them in finance; are they somewhat domain-specific?

krz commented 5 months ago

SAS files are integral within the health sector, especially while dealing with health authorities and regulators. SAS facilitates regulatory compliance, thereby it's a common choice among health professionals. Polars support would be very much appreciated.