Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k
stars
285
forks
source link
BatchedDataLoader with shuffling_queue_capacity=0 is very slow #653
We will use BatchedNoopShufflingBuffer as the underlying shuffling buffer implementation. The actual implementation is super slow when a large batch of data is added to it, since it will try to
while self._num_samples >= self.batch_size:
self._make_batch()
We typically end up with _num_samples being a very big number. Now with small batch size, a huge number of _make_batch calls would be made.
A better solution is to produce a batch each time a batch is requested and not to pay all the price upfront.
We will use BatchedNoopShufflingBuffer as the underlying shuffling buffer implementation. The actual implementation is super slow when a large batch of data is added to it, since it will try to
We typically end up with _num_samples being a very big number. Now with small batch size, a huge number of _make_batch calls would be made.
A better solution is to produce a batch each time a batch is requested and not to pay all the price upfront.