toddfarmer / arrow-migration

0 stars 1 forks source link

[Python] Add serialization callbacks for pandas objects in pyarrow.serialize #855

Closed toddfarmer closed 6 years ago

toddfarmer commented 7 years ago

Note: This issue was originally created as ARROW-1503. Please see the migration documentation for further details.

Original Issue Description:

toddfarmer commented 7 years ago

Note: Comment by Wes McKinney (wesm): Moved this to 0.8.0. There's some work to do here – to really efficiently handle pandas objects we should more efficiently handle pyarrow.Buffer in pyarrow.serialize.

toddfarmer commented 7 years ago

Note: Comment by Wes McKinney (wesm): [~pcmoritz] [~robertnishihara] what do you think about having a default serialization context where we can register serializers for pandas (and some other things, too)?

toddfarmer commented 7 years ago

Note: Comment by Robert Nishihara (robertnishihara): I think that's a good idea and would make things a lot more convenient. This makes sense for pandas dataframes. The types that we're currently registering in Ray (at https://github.com/ray-project/ray/blob/aebe9f937451bfa10aa0f2a41bafcf4747fb60f0/python/ray/worker.py#L1045-L1076) would probably make sense in the default serialization context. We're currently doing it for collections.OrderedDict, collections.defaultdict, and numpy arrays with non-nice dtypes.

We could also consider supporting pytorch tensors and things like that here (although of course we can't assume that pytorch is installed), though that might make more sense in Ray as opposed to pyarrow.

toddfarmer commented 6 years ago

Note: Comment by Philipp Moritz (pcmoritz): Issue resolved by pull request 1192 https://github.com/apache/arrow/pull/1192