matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.
MIT License
346 stars 25 forks source link

Support polars #160

Open deanm0000 opened 1 year ago

deanm0000 commented 1 year ago

Polars is a (relatively) new dataframe library that is gaining more popularity and blows pandas away in performance using arrow memory in the backend.

matthewwardrop commented 1 year ago

Hi @deanm0000 !

Thanks for the suggestion. There's a way to trivially implement support (i.e. how we currently implement support for pyarrow Tables) (by converting to pandas); or a more complicated integration that fully adds support for polars arrays everywhere; perhaps via just using Arrow arrays. The framework itself doesn't care about he datatypes, but some of the transforms do... and that will be the bulk of the work.

Of course, to get the performance benefits, converting everything to pandas defeats the purpose.

Do you have any instances where you are performance bottle-necked? Or is this more just a quality of life feature request?

deanm0000 commented 1 year ago

I guess, in those terms, it's a quality of life improvement. From a pure usability perspective it isn't hard to convert to pandas. I didn't realize that the pyarrow input just converted to pandas under the hood. I poked around really quickly and I couldn't find where in the code the transformations happen. Could you point me to that, like if I did Y~X+I(X^2).

matthewwardrop commented 1 year ago

The lazy arrow -> pandas conversion happens here: https://github.com/matthewwardrop/formulaic/blob/main/formulaic/materializers/arrow.py . In practice, under the hood, the data sometimes can pass through uncopied through this transaction, but then compute is done in numpy arrays or pandas Series depending on the transform. Again, the framework is datatype agnostic, so it is happy with other types... but we'd need to go through and update the transforms (like contrast encodings) to make sure they have implementations for these types.

glemaitre commented 1 year ago

Maybe on thing to consider here is the effort to come with a DataFrame API: https://data-apis.org/dataframe-api/draft/

It could be handy to write DataFrame agnostic code.

MarcoGorelli commented 4 months ago

Hi @matthewwardrop - would you be open to using Narwhals for this? Altair recently adopted it for this purpose https://github.com/vega/altair/pull/3452, as did scikit-lego

Happy to put up a POC if you'd be interested (just checking first!)