pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.16k stars 1.83k forks source link

Arrow PyCapsule Interface support #12530

Open wjones127 opened 9 months ago

wjones127 commented 9 months ago

Description

In the Arrow project, we recently created a new protocol for sharing Arrow data in Python. One of the goals of the protocol is allow exporting / importing Arrow data in Python without having to necessarily use PyArrow as an intermediary. For example, DuckDB can read from Polars DataFrames and LazyFrames, but only if PyArrow is installed. One this protocol is implemented, it would be possible to accomplish that integration without PyArrow.

This allows Arrow-exportable objects to be recognized based on the presence of one of several dunder methods.

Polars could implement this in two ways:

I'd be happy to contribute this to the repo, if these ideas sound good.

wjones127 commented 9 months ago

Looking through the codebase, it seems there is some basic work that needs to be done to make the Arrow interoperability more generic. Right now the import implementation seems to rely on PyArrow-specific APIs:

https://github.com/pola-rs/polars/blob/e461ffca6953f0f47a0b5e063a5d37d62a8c8e2a/py-polars/polars/utils/_construction.py#L1472-L1555

ritchie46 commented 8 months ago

Sorry for the delay. Somehow I missed this. I think this sounds great. Being agnostic to arrow consumer without hard pyarrow dependency sound good.

Does your offer still stand on this?

wjones127 commented 8 months ago

Yes, I’ve started work on this locally but got distracted. I’ll try to get back to it soon :)

eitsupi commented 7 months ago

Related to #14208

paleolimbot commented 7 months ago

I'm still working on the Python part, but ChunkedArray import/export to ArrowArrayStream in C++ just merged, which should make this more useful when applied to a Series: https://github.com/apache/arrow/pull/39455 .

eitsupi commented 4 months ago

FYI, I tried to implement ArrayStream import functionality in r-polars, but found a considerable speed reduction compared to the previous implementation (copied from py-polars), so I reverted (https://github.com/pola-rs/r-polars/pull/1078#issuecomment-2098507677).

deanm0000 commented 3 months ago

I wonder if using the __arrow_c_stream__ method would obviate this https://github.com/pola-rs/polars/issues/16614

deanm0000 commented 2 months ago

@wjones127 curious if this is still something you're working on?

wjones127 commented 2 months ago

curious if this is still something you're working on?

I haven't had time to finish this, no. I may return to this later this year, if someone else hasn't gotten to it.

kylebarron commented 1 month ago

I started a PR for data export in https://github.com/pola-rs/polars/pull/17676

kylebarron commented 1 month ago

And a PR for DataFrame import via the C Stream in #17693

kylebarron commented 1 month ago

This is mostly resolved by #17676, #17693, and https://github.com/pola-rs/polars/pull/17935. Potential follow ups include: