Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k
stars
284
forks
source link
Support inter-row-group shuffling queue when reading from pytorch #382
Added shuffling_queue_capacity and min_after_dequeue arguments to petastorm.pytorch.DataLoader class. When these values are set, an inmemory buffer is used to accumulate and reshuffle data being loaded.
The mechanism is not as flexible as PyTorch's sampling which is hard to implement because of the to have an efficient random data access in Parquet files.
Added
shuffling_queue_capacity
andmin_after_dequeue
arguments topetastorm.pytorch.DataLoader
class. When these values are set, an inmemory buffer is used to accumulate and reshuffle data being loaded.The mechanism is not as flexible as PyTorch's sampling which is hard to implement because of the to have an efficient random data access in Parquet files.