Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k
stars
284
forks
source link
Is there an easy way to transform native Parquet file to Petastorm datasets #463
Hi, I have a dataset saved as a native parquet file and I used "make_batch_reader" to read it and transform to tensorflow dataset. Everything works fine.
But with the native parquet file format, I can't control the number of batch size and I have to generate lots of parquet partition to reduce the number of training data in a batch in order to fix in the GPU memory which is really inefficient.
I was wondering is there an easy way to transform native Parquet file to Petastorm datasets?
Hi, I have a dataset saved as a native parquet file and I used "make_batch_reader" to read it and transform to tensorflow dataset. Everything works fine.
But with the native parquet file format, I can't control the number of batch size and I have to generate lots of parquet partition to reduce the number of training data in a batch in order to fix in the GPU memory which is really inefficient.
I was wondering is there an easy way to transform native Parquet file to Petastorm datasets?