uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Regarding performance of different read dataset methods #379

Closed panfengfeng closed 5 years ago

panfengfeng commented 5 years ago

I use make_petastorm_dataset API (make_petastorm_dataset(reader)) to scan the dataset, I found that it is much worse than reader loop method (for sample in reader), do you have any ideas about the make_petastorm_dataset API

selitvin commented 5 years ago

Can you please provide a little bit more information about your setup? Are you using make_reader or make_batch_reader functions to create your reader? Can you show a code snippet that represents how do you do the scanning?