Open wjones127 opened 9 months ago
Looking through the codebase, it seems there is some basic work that needs to be done to make the Arrow interoperability more generic. Right now the import implementation seems to rely on PyArrow-specific APIs:
Sorry for the delay. Somehow I missed this. I think this sounds great. Being agnostic to arrow consumer without hard pyarrow dependency sound good.
Does your offer still stand on this?
Yes, I’ve started work on this locally but got distracted. I’ll try to get back to it soon :)
Related to #14208
I'm still working on the Python part, but ChunkedArray import/export to ArrowArrayStream in C++ just merged, which should make this more useful when applied to a Series: https://github.com/apache/arrow/pull/39455 .
FYI, I tried to implement ArrayStream import functionality in r-polars, but found a considerable speed reduction compared to the previous implementation (copied from py-polars), so I reverted (https://github.com/pola-rs/r-polars/pull/1078#issuecomment-2098507677).
I wonder if using the __arrow_c_stream__
method would obviate this https://github.com/pola-rs/polars/issues/16614
@wjones127 curious if this is still something you're working on?
curious if this is still something you're working on?
I haven't had time to finish this, no. I may return to this later this year, if someone else hasn't gotten to it.
I started a PR for data export in https://github.com/pola-rs/polars/pull/17676
And a PR for DataFrame import via the C Stream in #17693
This is mostly resolved by #17676, #17693, and https://github.com/pola-rs/polars/pull/17935. Potential follow ups include:
requested_schema
Description
In the Arrow project, we recently created a new protocol for sharing Arrow data in Python. One of the goals of the protocol is allow exporting / importing Arrow data in Python without having to necessarily use PyArrow as an intermediary. For example, DuckDB can read from Polars DataFrames and LazyFrames, but only if PyArrow is installed. One this protocol is implemented, it would be possible to accomplish that integration without PyArrow.
This allows Arrow-exportable objects to be recognized based on the presence of one of several dunder methods.
Polars could implement this in two ways:
DataFrame
,Series
,DataType
polars.from_arrow
polars.DataFrame
constructorpd.DataFrame
, so it would make logical sense to support reading rectangular-shaped Arrow data.I'd be happy to contribute this to the repo, if these ideas sound good.