pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.69k stars 17.92k forks source link

ENH: support the Arrow PyCapsule Interface for importing data #59631

Open jorisvandenbossche opened 2 months ago

jorisvandenbossche commented 2 months ago

We have https://github.com/pandas-dev/pandas/pull/56587 and https://github.com/pandas-dev/pandas/issues/59518 now for exporting pandas DataFrame and Series through the Arrow PyCapsule Interface (i.e. adding __arrow_c_stream__ methods), but we don't yet have the import counterpart.

For importing, the specification doesn't provide any API guidelines on what this should look like, so we have a couple of options. The two main ones I can think of:

In pandas itself, we do have a couple of from_.. class methods (from_dict/from_records), but often for objects we also allow in the main constructor (at least for the dict case), but I think the main differentiator is that the specific class methods then have more specialized keyword arguments (and therefore allow a larger variety of input). So based on that pattern, we could also do both: add a DataFrame.from_arrow() class method, and then also accept such objects in pd.DataFrame(), passing through to from_arrow() (which could have more custom options to control how the conversion from arrow to pandas exactly is done).

Looking at polars, it seems they also have both, but I am not entirely sure about the connection between both. pl.from_arrow already existed but might be more specific for pyarrow? And then https://github.com/pola-rs/polars/pull/17693 added it to the main pl.DataFrame(..) constructor (@kylebarron)

For geopandas, I added a GeoDataFrame.from_arrow() method.

(to be clear, everything said above also applies to Series() / Series.from_arrow() etc)

cc @MarcoGorelli @WillAyd

mroeschke commented 2 months ago

xref https://github.com/pandas-dev/pandas/issues/54057 where a user expected pandas.DataFrame an ArrowDtype after passing a pyarrow object.

I would be +1 for a from_arrow constructor for objects with an Arrow PyCapsule Interface

WillAyd commented 2 months ago

Do you know why the pycapsule interface chose not to specify anything around imports? I vaguely recall some upstream conversations about that but not sure where it landed

My concern about the Python API is overloading the specification with a bunch of pandas-specific functionality. Maybe that is by design, but having something like Series.from_arrow(capsule, dtype_backend="numpy") seems a bit strange

kylebarron commented 2 months ago

Looking at polars, it seems they also have both, but I am not entirely sure about the connection between both.

polars has a module-level polars.from_arrow.

The main problem with a module-level from_arrow is that there's no certain way to know whether PyCapsule input that emits struct arrays is intended to be a Series or a DataFrame. In the Arrow C data interface, a struct array is overloaded for both uses, so we really need user intention to say which target class should be constructed.

My PR didn't touch the module level from_arrow. That constructor still only supports known inputs.

jorisvandenbossche commented 2 months ago

Do you know why the pycapsule interface chose not to specify anything around imports? I

It says 'This is left up to individual libraries". The dunder method is not user-visible API, so it is fine to make requirements there. But for a public import function or method, a library might want to make certain choices to be consistent within their own library.

For example, polars now uses pl.DataFrame, which would work for pandas as well, but the spec can't really require a library.DataFrame(..) usage (not every library uses that name, or uses class constructors, etc)

(now, while we speak about a public import method, it might certainly be a valid question whether there should be a protocol for import as well, so that you could roundtrip, but that's a different topic I think)

My concern about the Python API is overloading the specification with a bunch of pandas-specific functionality. Maybe that is by design, but having something like Series.from_arrow(capsule, dtype_backend="numpy") seems a bit strange

Why does that seem strange? We have such a keyword in other functions, so why not here? I would say that the point of a dedicated from_arrow(..) method is that it is more easy to add custom keywords when required.

xref #54057 where a user expected pandas.DataFrame an ArrowDtype after passing a pyarrow object.

Interesting reference. Personally, I think that by default a method to consume arrow data should also return default data types (and not ArrowDtype). We can give users control about that, though (like with the dtype backend in other IO methods)

WillAyd commented 2 months ago

Why does that seem strange? We have such a keyword in other functions, so why not here? I would say that the point of a dedicated from_arrow(..) method is that it is more easy to add custom keywords when required.

I think because this blurs the line between the PyCapsule interface as an exchange mechanism and that same interface as an end-user API. I'm of the impression our target audience is other developers and their libraries, not necessarily an end user using this like its an I/O method

WillAyd commented 2 months ago

To put a real use case, I've had a need for this in a library I created called pantab:

https://github.com/innobi/pantab/blob/ce3dc034102a506c2348de71169859c84c3be231/src/pantab/_reader.py#L13

At least from the perspective of that library, I ideally would want the dataframe libraries to all have one consistent interface. That way, my third party library could just say "ok, whatever dataframe library you are using, I'm just going to send this capsule through to X and you will get back the result you want"

If each library overloads their import mechanisms and offers different features, then third party producers of Arrow data aren't any better off than they are today

kylebarron commented 2 months ago

The PyCapsule Interface is focused on use cases around importing some foreign data to your library. I think the right way forward is not to specify a specific import API, but rather in advocating for more libraries to look for and understand pycapsule objects.

In your case where you have return_type, I'd argue that's an anti-pattern here. Instead, as long as you return any class that also implements the PyCapsule Interface, then users are able to pass that return object into whatever library they want.

import polars as pl
from arro3.compute import take
import pyarrow.parquet as pq

# Creates a polars object
df = pl.DataFrame({...})

# understands the polars object via C Stream
# returns an arro3 RecordBatchReader
filtered = take(df, [1, 4, 2, 5])

# understands the arro3 object via C Stream
pq.write_parquet(filtered)

If each library overloads their import mechanisms and offers different features, then third party producers of Arrow data aren't any better off than they are today

In particular, my argument is that an arrow producer should not choose the user-facing API but rather just expose the data protocol. Then the user can choose how to import the data as they wish.

WillAyd commented 2 months ago

In your case where you have return_type, I'd argue that's an anti-pattern here

Absolutely. To be clear, that code was from 7 months ago, before any library (except for pyarrow) started supporting imports. I am definitely trying to solve that pattern, not promote its usage

I think the right way forward is not to specify a specific import API, but rather in advocating for more libraries to look for and understand pycapsule objects.

Is the python capsule available at runtime? I thought it was just for extension authors and not really even inspectable (i.e. can you even do an isinstance check for one?) but maybe that knowledge is outdated

I really like the code that you have there @kylebarron, but the arro3.RecordBatchReader is the piece that I think we are missing in pandas. Maybe we need something like that instead of just passing around raw capsules?

kylebarron commented 2 months ago

Is the python capsule available at runtime?

Sorry, by "pycapsule objects" I meant to say "instances of classes that have Arrow PyCapsule Interface dunder methods and can export PyCapsules".

I really like the code that you have there @kylebarron, but the arro3.RecordBatchReader is the piece that I think we are missing in pandas. Maybe we need something like that instead of just passing around raw capsules?

Well, that's why I created arro3 😄. I wanted a lightweight (~7MB compared to pyarrow's >100MB) library that can manage Arrow data in a compliant way between libraries, but with nicer high-level APIs than nanoarrow. It has wheels for every platform, including pyodide.

WillAyd commented 2 months ago

Well I don't want to try and boil the ocean here, but I wonder if we don't require pyarrow that we shouldn't look at requiring arro3 as a fallback. I think there's good value in having another library provide a consistent object like a RecordBatchReader for data exchange like this, and we could just accept that in our series / dataframe constructor, rather than building that ourselves

kylebarron commented 2 months ago

Well, I'd say the point of arro3 is to handle cases like this. But at the same time stable enough to be a required pandas dependency is a pretty high bar...

I'd say that in managing Arrow data, arro3 is relatively stable, but that in managing interop with pandas and numpy it's less stable.