uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

BatchedDataLoader with shuffling_queue_capacity=0 is very slow #653

Open selitvin opened 3 years ago

selitvin commented 3 years ago

We will use BatchedNoopShufflingBuffer as the underlying shuffling buffer implementation. The actual implementation is super slow when a large batch of data is added to it, since it will try to

        while self._num_samples >= self.batch_size:
            self._make_batch()

We typically end up with _num_samples being a very big number. Now with small batch size, a huge number of _make_batch calls would be made.

A better solution is to produce a batch each time a batch is requested and not to pay all the price upfront.