pepkit / peppy

Project metadata manager for PEPs in Python
https://pep.databio.org/peppy
BSD 2-Clause "Simplified" License
37 stars 12 forks source link

Replace `pandas` with `polars` #486

Open nleroy917 opened 1 month ago

nleroy917 commented 1 month ago

I want to bring up the idea of replacing pandas with polars. I can think of three reasons why this would be beneficial:

Processing speed

polars is much faster. @khoroshevskyi has been investigating this and adoption of polars could drastically speed up the time it takes to process PEPs on the PEPhub server, enabling real-time edits to PEPs.

It's hard to find unbiased, fair comparisons especially considering the polars hype, but this post does a pretty good job highlighting some of the large improvements.

Import speed

From my own experimentation, importing polars is almost 4 times faster than importing pandas. This would work to improve things like the looper cli import issues: https://github.com/pepkit/looper/issues/476

Interface with genimtools

Genimtools is native-Rust with pyo3 bindings. polars follows this model as well. Because of this, the integration of peppy objects with genimtools becomes seamless. In fact, there is an entire crate maintained by the polars group dedicated to this interface.

This sets the stage for processing PEPs and their data in genimtools, further improving server speeds for real time PEP editing. eido comes to mind as a potential bottleneck with real-time PEP editing.

Potential downsides

I think some downsides to such a switch are: