Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Hi, I see that the reader class in this package replaces the dataset class in pytorch. I wonder if there is a similar function in the reader class like "getitem" in the dataset calss to support customized sampling for a batch. Thank you!
The current implementation is oriented towards streaming and does not support random sampling. This is an aware design choice since reading individual records from parquet is very inefficient.
Hi, I see that the reader class in this package replaces the dataset class in pytorch. I wonder if there is a similar function in the reader class like "getitem" in the dataset calss to support customized sampling for a batch. Thank you!