Closed v01dXYZ closed 3 years ago
I would support upgrading minimal requiremens to pyarrow > 2.0. I need to get back and revive that PR (or can review contributions). In my current day-job I am not using petastorm hence it's hard for me to provide timely support. There are folks in Uber that do use and support petastorm in their systems.
The migration to Dataset API is a little bit complicated because contrary to ParquetDataset
there are some major differences:
dataset.paths
)metadata
and common_metadata
are gonepartitions
attributes, no way to query the partioning
of a dataset for pyarrow==2.0.0
(it is only available from 5.0.0
on)Since the ParquetDatasetV2 is quite a poor shim over the Dataset API, it would be maybe worth to implement a PetastormDataset
class instead of using this shim to better fit our usage.
There are other minor differences:
schema+row group info
instead of pyarrow discovering the parquet files.Because of the partitioning
blocking point, I think it won't be straightforward to implement it, so I consider the issue as closed.
BTW, strange and wrong to see only a handful of people helped you to carry the maintenance burden while petastorm is likely used by many companies.
BTW, strange and wrong to see only a handful of people helped you to carry the maintenance burden while petastorm is likely used by many companies.
Ha! Thanks for the empathy :) Yeah, it would be great to see more help!
Hello, Would you be interested to use the new Dataset API instead of the ParquetDataset class, which is deprecated (mainly use
ds.parquet_dataset
andget_fragments
instead ofpieces
) ? I ask because it would force pyarrow as a dependency to be greater than 2.0. I've seen you tried to move from the deprecatedpyarrow.filesystem
to the newpyarrow.fs
but the Merge Request is still at a draft stage despite being several months old, indicating petastorm is trying to satisfy environments that can't afford a pyarrow update.Also since Uber ATG merged with Aurora, would the project still be internally used and maintained ?