Is there an easy way to transform native Parquet file to Petastorm datasets

uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Apache License 2.0

1.8k stars 284 forks source link

Is there an easy way to transform native Parquet file to Petastorm datasets #463

Closed LiuxyEric closed 4 years ago

LiuxyEric commented 4 years ago

Hi, I have a dataset saved as a native parquet file and I used "make_batch_reader" to read it and transform to tensorflow dataset. Everything works fine.

But with the native parquet file format, I can't control the number of batch size and I have to generate lots of parquet partition to reduce the number of training data in a batch in order to fix in the GPU memory which is really inefficient.

I was wondering is there an easy way to transform native Parquet file to Petastorm datasets?

selitvin commented 4 years ago

Would using tensorflow's unbatch operation help in your case? See an example here: https://github.com/uber/petastorm/issues/359#issuecomment-483507379

LiuxyEric commented 4 years ago

Thanks a lot! It's really helpful!