uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.76k stars 281 forks source link

Customized dataset #780

Closed JiajianLu closed 1 year ago

JiajianLu commented 1 year ago

Hi, I see that the reader class in this package replaces the dataset class in pytorch. I wonder if there is a similar function in the reader class like "getitem" in the dataset calss to support customized sampling for a batch. Thank you!

selitvin commented 1 year ago

The current implementation is oriented towards streaming and does not support random sampling. This is an aware design choice since reading individual records from parquet is very inefficient.