Open wjones127 opened 11 months ago
Looking through the codebase, it seems there is some basic work that needs to be done to make the Arrow interoperability more generic. Right now the import implementation seems to rely on PyArrow-specific APIs:
Sorry for the delay. Somehow I missed this. I think this sounds great. Being agnostic to arrow consumer without hard pyarrow dependency sound good.
Does your offer still stand on this?
Yes, I’ve started work on this locally but got distracted. I’ll try to get back to it soon :)
Related to #14208
I'm still working on the Python part, but ChunkedArray import/export to ArrowArrayStream in C++ just merged, which should make this more useful when applied to a Series: https://github.com/apache/arrow/pull/39455 .
FYI, I tried to implement ArrayStream import functionality in r-polars, but found a considerable speed reduction compared to the previous implementation (copied from py-polars), so I reverted (https://github.com/pola-rs/r-polars/pull/1078#issuecomment-2098507677).
I wonder if using the __arrow_c_stream__
method would obviate this https://github.com/pola-rs/polars/issues/16614
@wjones127 curious if this is still something you're working on?
curious if this is still something you're working on?
I haven't had time to finish this, no. I may return to this later this year, if someone else hasn't gotten to it.
I started a PR for data export in https://github.com/pola-rs/polars/pull/17676
And a PR for DataFrame import via the C Stream in #17693
This is mostly resolved by #17676, #17693, and https://github.com/pola-rs/polars/pull/17935. Potential follow ups include:
requested_schema
As mentioned in the Narwhals PR, and in the original post
Support Arrow PyCapsules in polars.from_arrow
I think this is still missing in polars.from_arrow
, right? I could put up a PR for that later this week (unless anyone has time first, in which case, feel free to take it!)
Supporting the PyCapsule Interface via a top-level from_arrow
isn't strictly possible because you don't know how to handle struct-typed arrays.
A struct Series with two float fields, x
and y
, is transported via the Arrow C Data/Stream interface exactly the same as a DataFrame/Table with two float columns, x
and y
. So supporting this in a general from_arrow
function isn't strictly possible because you don't know whether the user wants to materialize this data as a Series or DataFrame. That's why I only implemented support for this in DataFrame.__init__
and Series.__init__
.
Description
In the Arrow project, we recently created a new protocol for sharing Arrow data in Python. One of the goals of the protocol is allow exporting / importing Arrow data in Python without having to necessarily use PyArrow as an intermediary. For example, DuckDB can read from Polars DataFrames and LazyFrames, but only if PyArrow is installed. One this protocol is implemented, it would be possible to accomplish that integration without PyArrow.
This allows Arrow-exportable objects to be recognized based on the presence of one of several dunder methods.
Polars could implement this in two ways:
DataFrame
,Series
,DataType
polars.from_arrow
polars.DataFrame
constructorpd.DataFrame
, so it would make logical sense to support reading rectangular-shaped Arrow data.I'd be happy to contribute this to the repo, if these ideas sound good.