Closed jorisvandenbossche closed 2 years ago
Thank you. Very much appreciate you proactively reaching out on these issues!
I see that most usage of pyarrow.serialize was removed in https://github.com/uber/petastorm/pull/617, but it seems that this file is still using deprecated functionality: https://github.com/uber/petastorm/blob/10e0fc8655150afdcf4dbd9ac67dcfb2ccbb18d9/petastorm/local_disk_arrow_table_cache.py (LocalDiskArrowTableCache
)
Thanks for the pointer. #777 removes usage of the deprecated API. Thank you for the ping.
In the upcoming pyarrow 2.0 release (to be released one of the coming days), there are a few changes that will impact how petastorm is using pyarrow. (note, I am not a petastorm user, but a pyarrow developer. It was through https://github.com/uber/petastorm/issues/590/https://github.com/uber/petastorm/issues/604/https://issues.apache.org/jira/browse/ARROW-10029 opened by @dmcguire81 that I took a brief look at petastorm and noticed that it's a quite heavy user of pyarrow. So I thought it would be worth opening this issue).
The custom pyarrow
serialize
/deserialize
functionality is deprecated (it provided a python-specific custom serialization, not compatible with the cross-language IPC serialization format defined by Arrow). In general it is recommended to just use pickle instead, or if dealing with pyarrow objects (Table, RecordBatch) the IPC functionality. It seem you are only/mostly using it in https://github.com/uber/petastorm/blob/2889a058e3e0844df01fec908c62c9fe80d04c2d/petastorm/reader_impl/pyarrow_serializer.py#L20-L43, as alternative to pickle. For tables, you also haveArrowTableSerializer
, but that is already using the IPC functionality as recommended. So I suppose you can use pickle instead, but if you have specific use cases / feedback on the deprecation, let us know.The
pyarrow.filesystem
module is deprecated in favor of new filesystem implementations in thepyarrow.fs
module (those new filesystems are better integrated into Arrow'c C++ core, but also have a different API). I didn't check in detail how you are using this (outside of the use of ParquetDataset, see below), so don't directly know what update would need to be done.In addition, we are also reimplementing
pyarrow.parquet.ParquetDataset
, but will open a separate issue about that (-> https://github.com/uber/petastorm/issues/613)Happy to answer any questions / receive feedback about those changes!