open2c / bioframe

Genomic interval operations on Pandas DataFrames
MIT License
174 stars 28 forks source link

Alternative DataFrame class(es) for OOC + speed #137

Open ivirshup opened 1 year ago

ivirshup commented 1 year ago

Hey all,

I was wondering if you had considered supporting alternative dataframe classes in this library? In particular I was thinking about the lazy/ accelerated ones built on arrow (e.g. polars, datafusion).

I would hope that the current API could be amenable to this by @singledispatching functions to different backends. It could also be nice to take advantage of a backend that was able to make work with out-of-core amounts of data and do optimizations based column order.

I've also been having a good time interacting with annotation resources via ibis which could integrate nicely with this kind of approach.

Phlya commented 1 year ago

BTW pandas 2.0 will have a pyarrow backend... I wonder how that will work for bioframe.

ivirshup commented 1 year ago

BTW pandas 2.0 will have a pyarrow backend

Yup, I've already opened issues around the release candidate😅. Not actually that sure how much the current pyarrow backend is changing, or if it's just not experimental anymore.

But, while pyarrow will probably have better performance than pandas (especially with strings), I think backends like duckdb or polars have the much larger benefit of being able to work with out-of-core data efficiently.

endrebak commented 1 year ago

I am collaborating with the bioframe authors on this project (not in a usable state yet): https://github.com/endrebak/poranges

ivirshup commented 1 year ago

Related to this a request for input on defining a dataframe standard: https://data-apis.org/blog/dataframe_standard_rfc/