ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.42k stars 544 forks source link

feat: Implement Arrow PyCapsule Interface #9140

Closed kylebarron closed 1 month ago

kylebarron commented 1 month ago

Is your feature request related to a problem?

Currently Ibis integrates with Arrow via the to_pyarrow method. The downside of this is that library consumers have to:

  1. Be aware of ibis
  2. Look for specific ibis data types
  3. Know that they can call this to_pyarrow method, which tends to be named differently in different libraries. E.g. DuckDB calls it .arrow() and Polars calls it .to_arrow().

What is the motivation behind your request?

The Arrow PyCapsule Interface is a new standard for exchanging Arrow data in Python. Among other benefits, this defines a single method name (__arrow_c_stream__) that is public and standardized

This means that other libraries don't have to build specific connectors to Polars, DuckDB, Ibis, pyarrow, etc, but rather can implement support for any input object with an __arrow_c_stream__ method. For my particular use case, a geospatial visualization library I develop, Lonboard, is Arrow based and looks for this method.

Describe the solution you'd like

Implement an __arrow_c_stream__ method wherever there's currently a to_pyarrow method. This could be as simple as

    def __arrow_c_stream__(self, requested_schema):
        return self.to_pyarrow().__arrow_c_stream__(requested_schema)

where it uses the fact that the pyarrow Table class implements this as of v14 or so (not 100% sure which version it was added in)

What version of ibis are you running?

I haven't run ibis yet but __arrow_c_stream__ is not found in a code search of the repo.

What backend(s) are you using, if any?

No response

Code of Conduct

jcrist commented 1 month ago

Makes sense to me! Would we also want to implement __arrow_c_schema__ (as per the docs you linked)?

kylebarron commented 1 month ago

There's some ongoing discussion about this https://github.com/apache/arrow/issues/39689 but my own understanding is that if you're exporting a "table" or "stream of batches", then you'd only implement the __arrow_c_stream__ method. While if you're defining your own arrow-compatible data types (which maybe ibis is), then on that data type or field object you could implement __arrow_c_schema__

cpcloud commented 1 month ago

We have a single extension type used when exporting complex-type data from the snowflake backend to pyarrow. It's really only for internal use at the moment. Unclear how it would interact with the effort here, but I don't think it should be a blocker as it's the only backend that has this issue and I'm pretty sure we can solve the problem that type is solving in some other way if we need to.

jcrist commented 1 month ago

Sounds good to me - I pushed up a quick PR to support this for ibis.Table types in #9143.