uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Support inter-row-group shuffling queue when reading from pytorch #382

Closed selitvin closed 5 years ago

selitvin commented 5 years ago

Added shuffling_queue_capacity and min_after_dequeue arguments to petastorm.pytorch.DataLoader class. When these values are set, an inmemory buffer is used to accumulate and reshuffle data being loaded.

The mechanism is not as flexible as PyTorch's sampling which is hard to implement because of the to have an efficient random data access in Parquet files.