Closed toddfarmer closed 7 years ago
Note: Comment by Vincent Pham (vincentpham): Hi, if this is not taken, I would love to contribute to Apache Arrow.
Note: Comment by Uwe Korn (uwe):
[~vincentpham]
Hello, this isn't taken yet. Feel free to start working on it. If you have questions, ask on the mailing list or in the Arrow slack channel (https://apachearrowslackin.herokuapp.com/).
Note: Comment by Matthew Rocklin (mrocklin): I would love to see this issue get higher priority. I would like to experiment with using Arrow as Dask's network serialization format for pandas dataframes if it were implemented. I think we would see good speed boosts on communication heavy workloads like shuffles. This would be fun to write about afterwards.
Note: Comment by Wes McKinney (wesm): It is not difficult. I will mark for the 0.3 release (this month)
Note: Comment by Wes McKinney (wesm):
[~ahnj]
if you don't mind, I will take care of this one. It requires a bit of work to expose the custom_metadata
fields in the file metadata
Note: Comment by Jim Ahn (ahnj): Implementation and test for the pandas-index-to-column conversion is done.
I'm seeking some guidance on the second portion of the task - "Additionally there should be some metadata annotation of that column that it is derived of an Pandas Index, so that for roundtrips, we'll use it again as the index of a DataFrame."
How does one place a 'metadata annotation' to a Arrow column? A search of the repo did not reveal any obvious examples. Please provide some guidance. Thanks!
Note: Comment by Jim Ahn (ahnj): Oops. Wes, I did not see your comment until I posted mine. No worries. I'll move on to another task. Thanks.
Note: Comment by Matthew Rocklin (mrocklin):
I think we would see good speed boosts on communication heavy workloads like shuffles.
I need to walk back from this statement a bit. I implemented a crude solution using straight numpy that works in simple cases to see what I could expect from a full Arrow solution. I did not see as much improvement as I expected. Still trying to identify my current bottleneck.
Note: Comment by Wes McKinney (wesm): This is in progress in ARROW-881 https://github.com/apache/arrow/pull/612
Note: Comment by Wes McKinney (wesm): Removing this from release blocker. We can make 0.3.0.post artifacts if we want to get this out there before the 0.4 release
Note: Comment by Wes McKinney (wesm): This was done in https://github.com/apache/arrow/commit/bed01974321d9d1edeae9e474bd9df020b42ea10
Note: This issue was originally created as ARROW-376. Please see the migration documentation for further details.
Original Issue Description:
Currently the indices of a Pandas DataFrame are totally ignored on the Pandas to Arrow conversion. We should add an option to also convert the index to an Arrow column if they are not a simple range index.
The condition for a simple index should be
isinstance(df.index, pd.RangeIndex) && (df.index._start == 0) && (df.index._stop == len(df.index)) && (df.index._step == 1)
. In this case, we can always skip the index conversion. Otherwise, a new column in the Arrow table shall be created using the index' name as the name of the column. Additionally there should be some metadata annotation of that column that it is derived of an Pandas Index, so that for roundtrips, we'll use it again as the index of a DataFrame.Migrated issue participants:
Reporter: Uwe Korn (uwe) Assignee: Phillip Cloud (cpcloud)