Open kylebarron opened 1 month ago
Yes, this is on my to do list for the coming weeks! :)
If converting pandas objects to pyarrow always create non-chunked arrow objects, then perhaps we should implement
__arrow_c_array__
instead?
For our numpy-based dtypes, this is indeed the case. But the problem is that for the pyarrow-based columns, we actually store chunked arrays.
So when implementing one protocol, it should be __arrow_c_stream__
I think. I am not entirely sure how helpful it would be to also implement __arrow_c_array__
and let that either only work if there is only one chunk (and error otherwise) or let that concatenate implicitly when there are multiple chunks.
So when implementing one protocol, it should be
__arrow_c_stream__
I think.
Yes that makes sense and I'd agree. I'd suggest always exporting a stream.
Why wouldn't we support both methods? The fact that pyarrow arrays use chunked storage behind the scenes is an implementation detail, but for everything else (and from a front-end perspective) array makes more sense than array_stream
N.B. I'm also not sure the decision that went into us using chunked array storage for pyarrow-backed types
I think the question is: what should data consumers be able to infer based on the presence of one or more of these methods. See https://github.com/apache/arrow/issues/40648
__arrow_c_array__
, then the object's internal storage should already be contiguous. You shouldn't implement __arrow_c_array__
if calling that would (even sometimes) require a concatenation to happen. __arrow_c_stream__
, where the stream will sometimes include only a single array, but where it can be zero-copy as much as possible (in the pandas case: always except from non-arrow storage)There is still some discussion about using array vs stream interface, so reopening this issue.
@WillAyd 's comment from https://github.com/pandas-dev/pandas/pull/59587#issuecomment-2311278954:
To talk more practically, I am wondering about a scenario where we have a series that holds a chunked array. In this PR, we convert that to a singular array before then exposing it back as a stream, but the other point of view is that we could allow the stream to iterate over each chunk.
But then the question becomes what happens when that same Series gets used in a Dataframe? Does the dataframe iterate its chunks? In most cases, it seems highly unlikely that this is possible (unless all other Series of the Dataframe share the same chunked array size), so you get a rather interesting scenario where iterating by dataframe could be potentially far more expensive than the Series iteration.
The more I think through it I am leaning towards -1 on supporting the stream interface for the Series; I think we should just expose as an array for now
For the specific example of a Series with multiple chunks, and then put in a DataFrame: if you access that through the DataFrame's stream interface, you will also get multiple chunks.
In pyarrow, a Table is also not required to have equally chunked columns, but when converting to a stream of batches, it will kind of merge the chunking of the different columns (I think by default in such a way that everything can be zero-copy slices from the original data, i.e. so the smallest batches needed to do that for all columns).
Code example to illustrate:
So in practice, while we are discussing the usage of the Stream interface for Series
, I think effectively it would be for both DataFrame and Series (or would be there a reason to handle those differently? maybe for tabular data people generally expect more to work with streams, as in contrast to single columns/arrays?)
@jorisvandenbossche that's super cool - I did not realize Arrow created RecordBatches in that fashion. And that's zero-copy for the integral array in your example?
In that case then maybe we should export both the stream and array interfaces. I guess there's some ambiguity as a consumer if that is a zero-copy exchange or not, but maybe that doesn't matter (?)
Another consideration point is how we want to act as consumers of Arrow data, not just as a producer. If we push developers towards preferring stream data in this interface, the implication is that as a consumer we would need to prefer Arrow chunked array storage for our arrays.
I'm not sure that is a bad thing, but its different than where we are today
My general position is that PyCapsule Interface producers should not make it unintentionally easy to force memory copies. So if a pandas series could be either chunked or non-chunked, then only the C Stream interface should be implemented. Otherwise, implementing the C Array interface as well could allow consumers to unknowingly cause pandas to concatenate a series into an array.
IMO, the C Array interface should only be implemented if a pandas Series always stores a contiguous array under the hood. It's easy for consumers of a C stream to see if the stream only holds a single array (assuming the consumer materializes the full stream)
And that's zero-copy for the integral array in your example?
AFAIK yes
the implication is that as a consumer we would need to prefer Arrow chunked array storage for our arrays.
That's already the case for dtypes backed by pyarrow, I think. If you read a large parquet file, pyarrow will read that into a table with chunks, and converting that to pandas using a pyarrow-backed dtype for one of the columns will preserve that chunking.
My general position is that PyCapsule Interface producers should not make it unintentionally easy to force memory copies. So if a pandas series could be either chunked or non-chunked, then only the C Stream interface should be implemented.
Thanks, this is good input. I'm bought in!
If you read a large parquet file, pyarrow will read that into a table with chunks, and converting that to pandas using a pyarrow-backed dtype for one of the columns will preserve that chunking.
Cool! I don't think I voiced it well, but my major concern is around our usage of NumPy by default. For the I/O operation you described, I think the dtype_mapper will in many cases still end up copying that chunked data into a single array to fit into the NumPy view of the world.
Not trying to solve that here, but I think with I/O methods there is some control over how you want that conversion to happen. With this interface, I don't think we get that control? So if we produced data from a NumPy array, sent it to a consumer that then wants to send back new data for us to consume, wouldn't we be automatically going from NumPy to Arrow-backed?
So if we produced data from a NumPy array, sent it to a consumer that then wants to send back new data for us to consume, wouldn't we be automatically going from NumPy to Arrow-backed?
We don't yet have the consumption part in pandas (I am planning to look into that in the near future, but let that not stop anyone to already do that), but if we have a method to generally convert Arrow data to a pandas DataFrame, that in theory could have a similar keyword as other IO methods to determine which dtypes to use.
Makes sense. So for where we are on main now I think we just need to make it so that the implementation doesn't always copy (?)
Yes, and fix requested_schema
handling
Problem Description
Similar to existing DataFrame export in https://github.com/pandas-dev/pandas/pull/56587, I would like Pandas to export Series via the Arrow PyCapsule Interface.
Feature Description
Implementation would be quite similar to https://github.com/pandas-dev/pandas/pull/56587. I'm not sure what the ideal API is to convert a pandas Series to a pyarrow Array/ChunkedArray.
Alternative Solutions
Require users to pass a series into a DataFrame before passing to an external consumer?
Additional Context
Pyarrow has implemented the PyCapsule Interface on ChunkedArrays for the last couple versions. Ref https://github.com/apache/arrow/pull/40818
If converting pandas objects to pyarrow always create non-chunked arrow objects, then perhaps we should implement
__arrow_c_array__
instead?cc @jorisvandenbossche as he implemented https://github.com/pandas-dev/pandas/pull/56587