toddfarmer / arrow-migration

0 stars 1 forks source link

Python: Convert non-range Pandas indices (optionally) to Arrow #328

Closed toddfarmer closed 7 years ago

toddfarmer commented 8 years ago

Note: This issue was originally created as ARROW-376. Please see the migration documentation for further details.

Original Issue Description:

Currently the indices of a Pandas DataFrame are totally ignored on the Pandas to Arrow conversion. We should add an option to also convert the index to an Arrow column if they are not a simple range index.

The condition for a simple index should be isinstance(df.index, pd.RangeIndex) && (df.index._start == 0) && (df.index._stop == len(df.index)) && (df.index._step == 1). In this case, we can always skip the index conversion. Otherwise, a new column in the Arrow table shall be created using the index' name as the name of the column. Additionally there should be some metadata annotation of that column that it is derived of an Pandas Index, so that for roundtrips, we'll use it again as the index of a DataFrame.

Migrated issue participants:

Reporter: Uwe Korn (uwe) Assignee: Phillip Cloud (cpcloud)

toddfarmer commented 7 years ago

Note: Comment by Vincent Pham (vincentpham): Hi, if this is not taken, I would love to contribute to Apache Arrow.

toddfarmer commented 7 years ago

Note: Comment by Uwe Korn (uwe): [~vincentpham] Hello, this isn't taken yet. Feel free to start working on it. If you have questions, ask on the mailing list or in the Arrow slack channel (https://apachearrowslackin.herokuapp.com/).

toddfarmer commented 7 years ago

Note: Comment by Matthew Rocklin (mrocklin): I would love to see this issue get higher priority. I would like to experiment with using Arrow as Dask's network serialization format for pandas dataframes if it were implemented. I think we would see good speed boosts on communication heavy workloads like shuffles. This would be fun to write about afterwards.

toddfarmer commented 7 years ago

Note: Comment by Wes McKinney (wesm): It is not difficult. I will mark for the 0.3 release (this month)

toddfarmer commented 7 years ago

Note: Comment by Wes McKinney (wesm): [~ahnj] if you don't mind, I will take care of this one. It requires a bit of work to expose the custom_metadata fields in the file metadata

toddfarmer commented 7 years ago

Note: Comment by Jim Ahn (ahnj): Implementation and test for the pandas-index-to-column conversion is done.

I'm seeking some guidance on the second portion of the task - "Additionally there should be some metadata annotation of that column that it is derived of an Pandas Index, so that for roundtrips, we'll use it again as the index of a DataFrame."

How does one place a 'metadata annotation' to a Arrow column? A search of the repo did not reveal any obvious examples. Please provide some guidance. Thanks!

toddfarmer commented 7 years ago

Note: Comment by Jim Ahn (ahnj): Oops. Wes, I did not see your comment until I posted mine. No worries. I'll move on to another task. Thanks.

toddfarmer commented 7 years ago

Note: Comment by Matthew Rocklin (mrocklin):

I think we would see good speed boosts on communication heavy workloads like shuffles.

I need to walk back from this statement a bit. I implemented a crude solution using straight numpy that works in simple cases to see what I could expect from a full Arrow solution. I did not see as much improvement as I expected. Still trying to identify my current bottleneck.

toddfarmer commented 7 years ago

Note: Comment by Wes McKinney (wesm): This is in progress in ARROW-881 https://github.com/apache/arrow/pull/612

toddfarmer commented 7 years ago

Note: Comment by Wes McKinney (wesm): Removing this from release blocker. We can make 0.3.0.post artifacts if we want to get this out there before the 0.4 release

toddfarmer commented 7 years ago

Note: Comment by Wes McKinney (wesm): This was done in https://github.com/apache/arrow/commit/bed01974321d9d1edeae9e474bd9df020b42ea10