Upcoming changes in pyarrow 2.0

jorisvandenbossche commented 3 years ago

In the upcoming pyarrow 2.0 release (to be released one of the coming days), there are a few changes that will impact how petastorm is using pyarrow. (note, I am not a petastorm user, but a pyarrow developer. It was through https://github.com/uber/petastorm/issues/590/https://github.com/uber/petastorm/issues/604/https://issues.apache.org/jira/browse/ARROW-10029 opened by @dmcguire81 that I took a brief look at petastorm and noticed that it's a quite heavy user of pyarrow. So I thought it would be worth opening this issue).

The custom pyarrow serialize/deserialize functionality is deprecated (it provided a python-specific custom serialization, not compatible with the cross-language IPC serialization format defined by Arrow). In general it is recommended to just use pickle instead, or if dealing with pyarrow objects (Table, RecordBatch) the IPC functionality. It seem you are only/mostly using it in https://github.com/uber/petastorm/blob/2889a058e3e0844df01fec908c62c9fe80d04c2d/petastorm/reader_impl/pyarrow_serializer.py#L20-L43, as alternative to pickle. For tables, you also have ArrowTableSerializer, but that is already using the IPC functionality as recommended. So I suppose you can use pickle instead, but if you have specific use cases / feedback on the deprecation, let us know.
The pyarrow.filesystem module is deprecated in favor of new filesystem implementations in the pyarrow.fs module (those new filesystems are better integrated into Arrow'c C++ core, but also have a different API). I didn't check in detail how you are using this (outside of the use of ParquetDataset, see below), so don't directly know what update would need to be done.

In addition, we are also reimplementing pyarrow.parquet.ParquetDataset, but will open a separate issue about that (-> https://github.com/uber/petastorm/issues/613)

Happy to answer any questions / receive feedback about those changes!

selitvin commented 3 years ago

Thank you. Very much appreciate you proactively reaching out on these issues!

jorisvandenbossche commented 2 years ago

selitvin commented 2 years ago

Thanks for the pointer. #777 removes usage of the deprecated API. Thank you for the ping.

uber / petastorm