Open deanm0000 opened 1 year ago
Hi @deanm0000 !
Thanks for the suggestion. There's a way to trivially implement support (i.e. how we currently implement support for pyarrow Table
s) (by converting to pandas); or a more complicated integration that fully adds support for polars
arrays everywhere; perhaps via just using Arrow
arrays. The framework itself doesn't care about he datatypes, but some of the transforms do... and that will be the bulk of the work.
Of course, to get the performance benefits, converting everything to pandas defeats the purpose.
Do you have any instances where you are performance bottle-necked? Or is this more just a quality of life feature request?
I guess, in those terms, it's a quality of life improvement. From a pure usability perspective it isn't hard to convert to pandas. I didn't realize that the pyarrow input just converted to pandas under the hood. I poked around really quickly and I couldn't find where in the code the transformations happen. Could you point me to that, like if I did Y~X+I(X^2)
.
The lazy arrow -> pandas conversion happens here: https://github.com/matthewwardrop/formulaic/blob/main/formulaic/materializers/arrow.py . In practice, under the hood, the data sometimes can pass through uncopied through this transaction, but then compute is done in numpy arrays or pandas Series depending on the transform. Again, the framework is datatype agnostic, so it is happy with other types... but we'd need to go through and update the transforms (like contrast encodings) to make sure they have implementations for these types.
Maybe on thing to consider here is the effort to come with a DataFrame API: https://data-apis.org/dataframe-api/draft/
It could be handy to write DataFrame agnostic code.
Hi @matthewwardrop - would you be open to using Narwhals for this? Altair recently adopted it for this purpose https://github.com/vega/altair/pull/3452, as did scikit-lego
Happy to put up a POC if you'd be interested (just checking first!)
Polars is a (relatively) new dataframe library that is gaining more popularity and blows pandas away in performance using arrow memory in the backend.