pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.2k stars 1.95k forks source link

Arrow PyCapsule Interface support #12530

Open wjones127 opened 11 months ago

wjones127 commented 11 months ago

Description

In the Arrow project, we recently created a new protocol for sharing Arrow data in Python. One of the goals of the protocol is allow exporting / importing Arrow data in Python without having to necessarily use PyArrow as an intermediary. For example, DuckDB can read from Polars DataFrames and LazyFrames, but only if PyArrow is installed. One this protocol is implemented, it would be possible to accomplish that integration without PyArrow.

This allows Arrow-exportable objects to be recognized based on the presence of one of several dunder methods.

Polars could implement this in two ways:

I'd be happy to contribute this to the repo, if these ideas sound good.

wjones127 commented 11 months ago

Looking through the codebase, it seems there is some basic work that needs to be done to make the Arrow interoperability more generic. Right now the import implementation seems to rely on PyArrow-specific APIs:

https://github.com/pola-rs/polars/blob/e461ffca6953f0f47a0b5e063a5d37d62a8c8e2a/py-polars/polars/utils/_construction.py#L1472-L1555

ritchie46 commented 10 months ago

Sorry for the delay. Somehow I missed this. I think this sounds great. Being agnostic to arrow consumer without hard pyarrow dependency sound good.

Does your offer still stand on this?

wjones127 commented 10 months ago

Yes, I’ve started work on this locally but got distracted. I’ll try to get back to it soon :)

eitsupi commented 9 months ago

Related to #14208

paleolimbot commented 9 months ago

I'm still working on the Python part, but ChunkedArray import/export to ArrowArrayStream in C++ just merged, which should make this more useful when applied to a Series: https://github.com/apache/arrow/pull/39455 .

eitsupi commented 6 months ago

FYI, I tried to implement ArrayStream import functionality in r-polars, but found a considerable speed reduction compared to the previous implementation (copied from py-polars), so I reverted (https://github.com/pola-rs/r-polars/pull/1078#issuecomment-2098507677).

deanm0000 commented 5 months ago

I wonder if using the __arrow_c_stream__ method would obviate this https://github.com/pola-rs/polars/issues/16614

deanm0000 commented 4 months ago

@wjones127 curious if this is still something you're working on?

wjones127 commented 4 months ago

curious if this is still something you're working on?

I haven't had time to finish this, no. I may return to this later this year, if someone else hasn't gotten to it.

kylebarron commented 3 months ago

I started a PR for data export in https://github.com/pola-rs/polars/pull/17676

kylebarron commented 3 months ago

And a PR for DataFrame import via the C Stream in #17693

kylebarron commented 3 months ago

This is mostly resolved by #17676, #17693, and https://github.com/pola-rs/polars/pull/17935. Potential follow ups include:

MarcoGorelli commented 3 weeks ago

As mentioned in the Narwhals PR, and in the original post

Support Arrow PyCapsules in polars.from_arrow

I think this is still missing in polars.from_arrow, right? I could put up a PR for that later this week (unless anyone has time first, in which case, feel free to take it!)

kylebarron commented 3 weeks ago

Supporting the PyCapsule Interface via a top-level from_arrow isn't strictly possible because you don't know how to handle struct-typed arrays.

A struct Series with two float fields, x and y, is transported via the Arrow C Data/Stream interface exactly the same as a DataFrame/Table with two float columns, x and y. So supporting this in a general from_arrow function isn't strictly possible because you don't know whether the user wants to materialize this data as a Series or DataFrame. That's why I only implemented support for this in DataFrame.__init__ and Series.__init__.