pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.31k stars 17.8k forks source link

ENH: support the Arrow PyCapsule Interface on pandas.Series (export) #59518

Open kylebarron opened 1 month ago

kylebarron commented 1 month ago

Problem Description

Similar to existing DataFrame export in https://github.com/pandas-dev/pandas/pull/56587, I would like Pandas to export Series via the Arrow PyCapsule Interface.

Feature Description

Implementation would be quite similar to https://github.com/pandas-dev/pandas/pull/56587. I'm not sure what the ideal API is to convert a pandas Series to a pyarrow Array/ChunkedArray.

Alternative Solutions

Require users to pass a series into a DataFrame before passing to an external consumer?

Additional Context

Pyarrow has implemented the PyCapsule Interface on ChunkedArrays for the last couple versions. Ref https://github.com/apache/arrow/pull/40818

If converting pandas objects to pyarrow always create non-chunked arrow objects, then perhaps we should implement __arrow_c_array__ instead?

cc @jorisvandenbossche as he implemented https://github.com/pandas-dev/pandas/pull/56587

jorisvandenbossche commented 1 month ago

Yes, this is on my to do list for the coming weeks! :)

If converting pandas objects to pyarrow always create non-chunked arrow objects, then perhaps we should implement __arrow_c_array__ instead?

For our numpy-based dtypes, this is indeed the case. But the problem is that for the pyarrow-based columns, we actually store chunked arrays.

So when implementing one protocol, it should be __arrow_c_stream__ I think. I am not entirely sure how helpful it would be to also implement __arrow_c_array__ and let that either only work if there is only one chunk (and error otherwise) or let that concatenate implicitly when there are multiple chunks.

kylebarron commented 1 month ago

So when implementing one protocol, it should be __arrow_c_stream__ I think.

Yes that makes sense and I'd agree. I'd suggest always exporting a stream.

WillAyd commented 1 month ago

Why wouldn't we support both methods? The fact that pyarrow arrays use chunked storage behind the scenes is an implementation detail, but for everything else (and from a front-end perspective) array makes more sense than array_stream

WillAyd commented 1 month ago

N.B. I'm also not sure the decision that went into us using chunked array storage for pyarrow-backed types

kylebarron commented 1 month ago

I think the question is: what should data consumers be able to infer based on the presence of one or more of these methods. See https://github.com/apache/arrow/issues/40648

jorisvandenbossche commented 3 weeks ago

There is still some discussion about using array vs stream interface, so reopening this issue.

@WillAyd 's comment from https://github.com/pandas-dev/pandas/pull/59587#issuecomment-2311278954:

To talk more practically, I am wondering about a scenario where we have a series that holds a chunked array. In this PR, we convert that to a singular array before then exposing it back as a stream, but the other point of view is that we could allow the stream to iterate over each chunk.

But then the question becomes what happens when that same Series gets used in a Dataframe? Does the dataframe iterate its chunks? In most cases, it seems highly unlikely that this is possible (unless all other Series of the Dataframe share the same chunked array size), so you get a rather interesting scenario where iterating by dataframe could be potentially far more expensive than the Series iteration.

The more I think through it I am leaning towards -1 on supporting the stream interface for the Series; I think we should just expose as an array for now

For the specific example of a Series with multiple chunks, and then put in a DataFrame: if you access that through the DataFrame's stream interface, you will also get multiple chunks.

In pyarrow, a Table is also not required to have equally chunked columns, but when converting to a stream of batches, it will kind of merge the chunking of the different columns (I think by default in such a way that everything can be zero-copy slices from the original data, i.e. so the smallest batches needed to do that for all columns).

Code example to illustrate:

``` In [21]: pd.options.future.infer_string = True In [22]: ser = pd.concat([pd.Series(["a"]), pd.Series(["b"])], ignore_index=True) In [23]: ser.array._pa_array Out[23]: [ [ "a" ], [ "b" ] ] In [24]: df = pd.DataFrame({"A": [1, 2], "B": ser}) In [25]: df Out[25]: A B 0 1 a 1 2 b In [26]: table = pa.Table.from_pandas(df) In [27]: table Out[27]: pyarrow.Table A: int64 B: large_string ---- A: [[1,2]] B: [["a"],["b"]] In [28]: table["B"] Out[28]: [ [ "a" ], [ "b" ] ] In [29]: list(table.to_reader()) Out[29]: [pyarrow.RecordBatch A: int64 B: large_string ---- A: [1] B: ["a"], pyarrow.RecordBatch A: int64 B: large_string ---- A: [2] B: ["b"]] ```

So in practice, while we are discussing the usage of the Stream interface for Series, I think effectively it would be for both DataFrame and Series (or would be there a reason to handle those differently? maybe for tabular data people generally expect more to work with streams, as in contrast to single columns/arrays?)

WillAyd commented 3 weeks ago

@jorisvandenbossche that's super cool - I did not realize Arrow created RecordBatches in that fashion. And that's zero-copy for the integral array in your example?

In that case then maybe we should export both the stream and array interfaces. I guess there's some ambiguity as a consumer if that is a zero-copy exchange or not, but maybe that doesn't matter (?)

WillAyd commented 3 weeks ago

Another consideration point is how we want to act as consumers of Arrow data, not just as a producer. If we push developers towards preferring stream data in this interface, the implication is that as a consumer we would need to prefer Arrow chunked array storage for our arrays.

I'm not sure that is a bad thing, but its different than where we are today

kylebarron commented 3 weeks ago

My general position is that PyCapsule Interface producers should not make it unintentionally easy to force memory copies. So if a pandas series could be either chunked or non-chunked, then only the C Stream interface should be implemented. Otherwise, implementing the C Array interface as well could allow consumers to unknowingly cause pandas to concatenate a series into an array.

IMO, the C Array interface should only be implemented if a pandas Series always stores a contiguous array under the hood. It's easy for consumers of a C stream to see if the stream only holds a single array (assuming the consumer materializes the full stream)

jorisvandenbossche commented 3 weeks ago

And that's zero-copy for the integral array in your example?

AFAIK yes

the implication is that as a consumer we would need to prefer Arrow chunked array storage for our arrays.

That's already the case for dtypes backed by pyarrow, I think. If you read a large parquet file, pyarrow will read that into a table with chunks, and converting that to pandas using a pyarrow-backed dtype for one of the columns will preserve that chunking.

WillAyd commented 3 weeks ago

My general position is that PyCapsule Interface producers should not make it unintentionally easy to force memory copies. So if a pandas series could be either chunked or non-chunked, then only the C Stream interface should be implemented.

Thanks, this is good input. I'm bought in!

If you read a large parquet file, pyarrow will read that into a table with chunks, and converting that to pandas using a pyarrow-backed dtype for one of the columns will preserve that chunking.

Cool! I don't think I voiced it well, but my major concern is around our usage of NumPy by default. For the I/O operation you described, I think the dtype_mapper will in many cases still end up copying that chunked data into a single array to fit into the NumPy view of the world.

Not trying to solve that here, but I think with I/O methods there is some control over how you want that conversion to happen. With this interface, I don't think we get that control? So if we produced data from a NumPy array, sent it to a consumer that then wants to send back new data for us to consume, wouldn't we be automatically going from NumPy to Arrow-backed?

jorisvandenbossche commented 3 weeks ago

So if we produced data from a NumPy array, sent it to a consumer that then wants to send back new data for us to consume, wouldn't we be automatically going from NumPy to Arrow-backed?

We don't yet have the consumption part in pandas (I am planning to look into that in the near future, but let that not stop anyone to already do that), but if we have a method to generally convert Arrow data to a pandas DataFrame, that in theory could have a similar keyword as other IO methods to determine which dtypes to use.

WillAyd commented 3 weeks ago

Makes sense. So for where we are on main now I think we just need to make it so that the implementation doesn't always copy (?)

jorisvandenbossche commented 3 weeks ago

Yes, and fix requested_schema handling