Open jorisvandenbossche opened 2 months ago
xref https://github.com/pandas-dev/pandas/issues/54057 where a user expected pandas.DataFrame
an ArrowDtype
after passing a pyarrow object.
I would be +1 for a from_arrow
constructor for objects with an Arrow PyCapsule Interface
Do you know why the pycapsule interface chose not to specify anything around imports? I vaguely recall some upstream conversations about that but not sure where it landed
My concern about the Python API is overloading the specification with a bunch of pandas-specific functionality. Maybe that is by design, but having something like Series.from_arrow(capsule, dtype_backend="numpy")
seems a bit strange
Looking at polars, it seems they also have both, but I am not entirely sure about the connection between both.
polars has a module-level polars.from_arrow
.
The main problem with a module-level from_arrow
is that there's no certain way to know whether PyCapsule input that emits struct arrays is intended to be a Series or a DataFrame. In the Arrow C data interface, a struct array is overloaded for both uses, so we really need user intention to say which target class should be constructed.
My PR didn't touch the module level from_arrow
. That constructor still only supports known inputs.
Do you know why the pycapsule interface chose not to specify anything around imports? I
It says 'This is left up to individual libraries". The dunder method is not user-visible API, so it is fine to make requirements there. But for a public import function or method, a library might want to make certain choices to be consistent within their own library.
For example, polars now uses pl.DataFrame
, which would work for pandas as well, but the spec can't really require a library.DataFrame(..)
usage (not every library uses that name, or uses class constructors, etc)
(now, while we speak about a public import method, it might certainly be a valid question whether there should be a protocol for import as well, so that you could roundtrip, but that's a different topic I think)
My concern about the Python API is overloading the specification with a bunch of pandas-specific functionality. Maybe that is by design, but having something like
Series.from_arrow(capsule, dtype_backend="numpy")
seems a bit strange
Why does that seem strange? We have such a keyword in other functions, so why not here? I would say that the point of a dedicated from_arrow(..)
method is that it is more easy to add custom keywords when required.
xref #54057 where a user expected
pandas.DataFrame
anArrowDtype
after passing a pyarrow object.
Interesting reference. Personally, I think that by default a method to consume arrow data should also return default data types (and not ArrowDtype). We can give users control about that, though (like with the dtype backend in other IO methods)
Why does that seem strange? We have such a keyword in other functions, so why not here? I would say that the point of a dedicated
from_arrow(..)
method is that it is more easy to add custom keywords when required.
I think because this blurs the line between the PyCapsule interface as an exchange mechanism and that same interface as an end-user API. I'm of the impression our target audience is other developers and their libraries, not necessarily an end user using this like its an I/O method
To put a real use case, I've had a need for this in a library I created called pantab:
At least from the perspective of that library, I ideally would want the dataframe libraries to all have one consistent interface. That way, my third party library could just say "ok, whatever dataframe library you are using, I'm just going to send this capsule through to X and you will get back the result you want"
If each library overloads their import mechanisms and offers different features, then third party producers of Arrow data aren't any better off than they are today
The PyCapsule Interface is focused on use cases around importing some foreign data to your library. I think the right way forward is not to specify a specific import API, but rather in advocating for more libraries to look for and understand pycapsule objects.
In your case where you have return_type
, I'd argue that's an anti-pattern here. Instead, as long as you return any class that also implements the PyCapsule Interface, then users are able to pass that return object into whatever library they want.
import polars as pl
from arro3.compute import take
import pyarrow.parquet as pq
# Creates a polars object
df = pl.DataFrame({...})
# understands the polars object via C Stream
# returns an arro3 RecordBatchReader
filtered = take(df, [1, 4, 2, 5])
# understands the arro3 object via C Stream
pq.write_parquet(filtered)
If each library overloads their import mechanisms and offers different features, then third party producers of Arrow data aren't any better off than they are today
In particular, my argument is that an arrow producer should not choose the user-facing API but rather just expose the data protocol. Then the user can choose how to import the data as they wish.
In your case where you have
return_type
, I'd argue that's an anti-pattern here
Absolutely. To be clear, that code was from 7 months ago, before any library (except for pyarrow) started supporting imports. I am definitely trying to solve that pattern, not promote its usage
I think the right way forward is not to specify a specific import API, but rather in advocating for more libraries to look for and understand pycapsule objects.
Is the python capsule available at runtime? I thought it was just for extension authors and not really even inspectable (i.e. can you even do an isinstance
check for one?) but maybe that knowledge is outdated
I really like the code that you have there @kylebarron, but the arro3.RecordBatchReader is the piece that I think we are missing in pandas. Maybe we need something like that instead of just passing around raw capsules?
Is the python capsule available at runtime?
Sorry, by "pycapsule objects" I meant to say "instances of classes that have Arrow PyCapsule Interface dunder methods and can export PyCapsules".
I really like the code that you have there @kylebarron, but the arro3.RecordBatchReader is the piece that I think we are missing in pandas. Maybe we need something like that instead of just passing around raw capsules?
Well, that's why I created arro3 😄. I wanted a lightweight (~7MB compared to pyarrow's >100MB) library that can manage Arrow data in a compliant way between libraries, but with nicer high-level APIs than nanoarrow. It has wheels for every platform, including pyodide.
Well I don't want to try and boil the ocean here, but I wonder if we don't require pyarrow that we shouldn't look at requiring arro3 as a fallback. I think there's good value in having another library provide a consistent object like a RecordBatchReader for data exchange like this, and we could just accept that in our series / dataframe constructor, rather than building that ourselves
Well, I'd say the point of arro3 is to handle cases like this. But at the same time stable enough to be a required pandas dependency is a pretty high bar...
I'd say that in managing Arrow data, arro3 is relatively stable, but that in managing interop with pandas and numpy it's less stable.
We have https://github.com/pandas-dev/pandas/pull/56587 and https://github.com/pandas-dev/pandas/issues/59518 now for exporting pandas DataFrame and Series through the Arrow PyCapsule Interface (i.e. adding
__arrow_c_stream__
methods), but we don't yet have the import counterpart.For importing, the specification doesn't provide any API guidelines on what this should look like, so we have a couple of options. The two main ones I can think of:
from_arrow()
method, which could be top level (pd.from_arrow(..)
) or as class methods (pd.DataFrame.from_arrow(..)
)pd.Dataframe(..)
)In pandas itself, we do have a couple of
from_..
class methods (from_dict
/from_records
), but often for objects we also allow in the main constructor (at least for the dict case), but I think the main differentiator is that the specific class methods then have more specialized keyword arguments (and therefore allow a larger variety of input). So based on that pattern, we could also do both: add aDataFrame.from_arrow()
class method, and then also accept such objects inpd.DataFrame()
, passing through tofrom_arrow()
(which could have more custom options to control how the conversion from arrow to pandas exactly is done).Looking at polars, it seems they also have both, but I am not entirely sure about the connection between both.
pl.from_arrow
already existed but might be more specific for pyarrow? And then https://github.com/pola-rs/polars/pull/17693 added it to the mainpl.DataFrame(..)
constructor (@kylebarron)For geopandas, I added a
GeoDataFrame.from_arrow()
method.(to be clear, everything said above also applies to
Series()
/Series.from_arrow()
etc)cc @MarcoGorelli @WillAyd