uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Dataset API and pyarrow>=2.0 #700

Closed v01dXYZ closed 3 years ago

v01dXYZ commented 3 years ago

Hello, Would you be interested to use the new Dataset API instead of the ParquetDataset class, which is deprecated (mainly use ds.parquet_dataset and get_fragments instead of pieces ) ? I ask because it would force pyarrow as a dependency to be greater than 2.0. I've seen you tried to move from the deprecated pyarrow.filesystem to the new pyarrow.fs but the Merge Request is still at a draft stage despite being several months old, indicating petastorm is trying to satisfy environments that can't afford a pyarrow update.

Also since Uber ATG merged with Aurora, would the project still be internally used and maintained ?

selitvin commented 3 years ago

I would support upgrading minimal requiremens to pyarrow > 2.0. I need to get back and revive that PR (or can review contributions). In my current day-job I am not using petastorm hence it's hard for me to provide timely support. There are folks in Uber that do use and support petastorm in their systems.

v01dXYZ commented 3 years ago

The migration to Dataset API is a little bit complicated because contrary to ParquetDataset there are some major differences:

Since the ParquetDatasetV2 is quite a poor shim over the Dataset API, it would be maybe worth to implement a PetastormDataset class instead of using this shim to better fit our usage.

There are other minor differences:

Because of the partitioning blocking point, I think it won't be straightforward to implement it, so I consider the issue as closed.

BTW, strange and wrong to see only a handful of people helped you to carry the maintenance burden while petastorm is likely used by many companies.

selitvin commented 3 years ago

BTW, strange and wrong to see only a handful of people helped you to carry the maintenance burden while petastorm is likely used by many companies.

Ha! Thanks for the empathy :) Yeah, it would be great to see more help!