uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Upcoming changes in pyarrow 2.0 #612

Closed jorisvandenbossche closed 2 years ago

jorisvandenbossche commented 3 years ago

In the upcoming pyarrow 2.0 release (to be released one of the coming days), there are a few changes that will impact how petastorm is using pyarrow. (note, I am not a petastorm user, but a pyarrow developer. It was through https://github.com/uber/petastorm/issues/590/https://github.com/uber/petastorm/issues/604/https://issues.apache.org/jira/browse/ARROW-10029 opened by @dmcguire81 that I took a brief look at petastorm and noticed that it's a quite heavy user of pyarrow. So I thought it would be worth opening this issue).

In addition, we are also reimplementing pyarrow.parquet.ParquetDataset, but will open a separate issue about that (-> https://github.com/uber/petastorm/issues/613)

Happy to answer any questions / receive feedback about those changes!

selitvin commented 3 years ago

Thank you. Very much appreciate you proactively reaching out on these issues!

jorisvandenbossche commented 2 years ago

I see that most usage of pyarrow.serialize was removed in https://github.com/uber/petastorm/pull/617, but it seems that this file is still using deprecated functionality: https://github.com/uber/petastorm/blob/10e0fc8655150afdcf4dbd9ac67dcfb2ccbb18d9/petastorm/local_disk_arrow_table_cache.py (LocalDiskArrowTableCache)

selitvin commented 2 years ago

Thanks for the pointer. #777 removes usage of the deprecated API. Thank you for the ping.