ianepreston / stats_can

Get Statistics Canada data into python (mostly pandas)
GNU General Public License v3.0
60 stars 23 forks source link

Add support for Polars dataframes #461

Open DeflateAwning opened 1 month ago

DeflateAwning commented 1 month ago

Polars dataframes are way faster, more memory efficient, and have a more ergonomic interface for transformations.

At some point, you may want to switch the backend to Polars. At least for now, I think it makes sense to make a function that returns the result as a Polars dataframe without first going through Pandas (assuming the dataframe's current construction technique allows for it).

Fantastic library though! Very excited to check it out further.

ianepreston commented 1 month ago

@DeflateAwning, I've been thinking about refactoring this project more significantly to make the dataframe layer an optional extension, with the core package only relying on querying the REST api and downloading files. I'm not a polars user but in my professional life I would benefit from this library reading directly from csv into a spark dataframe, and the alterations that would allow that would permit a polars extension as well. The first step to doing this is adding some deprecation warnings to the existing parts of the code base that depend on pandas and adding some pandas specific functions so that users of the existing setup can transition. I'm not sure when I'll have time to do all that, and I'll need to allow some adoption time to pass before I rip out features so this will not be a quick change, but I support the direction

DeflateAwning commented 1 month ago

Awesome, exciting news with all that! Looking forward to seeing the direction this all goes!

Depending on what the API responses look like (e.g., if they're table partitions or similar), Polars would be a great choice to store the intermediate data in, and is a great tool for converting to on-disk csv/parquet/other for storage. It's way more lightweight than Spark, and it's way more performant than Pandas. It supports converting efficiently to each of those dataframe types also, which means solid inter-op with Pandas and Spark.