uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Performance benchmarks against HDF5 #358

Open georghildebrand opened 5 years ago

georghildebrand commented 5 years ago

Hi all, great work and glad to see progress here. Is there anywhere a comparison between petastrom and HDF5/bcolz/zarr ?

selitvin commented 5 years ago

Hi. These formats look interesting. We never evaluated their performance but these do surely look promissing. It would be interesting to try. We are playing internally with making Parquet just another backend implementations alternatives for Petastorm. We can consider making bcolz or zarr another backend alternative for Petastorm.

georghildebrand commented 5 years ago

I think for cases from machine learning, algorithmic etc. that would heavily make sense as i currently see parquet mostly for relational data ... maybe i am wrong in this assumption.

un-knight commented 5 years ago

And I would like to see the comparison with lmdb.