Implementing Asynchronous Data Shuffling Part

Existing dataloader implementation is sequential, for common use case, it limits training speed if data shuffle part is slow. This happens when user defines a pretty large shuffling queue capacity, whenever dataloader reads next row group, shuffling queue buffer has to compress and reshuffle.

An idea of using asynchronous dataloader is to move parquet file reading, shuffling and producing batches into an asynchronous thread, maintain a batch queue for consumer (training thread) and producer (dataloader thread). In this case, during main thread training and backpropogragion stages, asynchronous can keep generating batches, to overlap the cost.

We have an idea of introducing asynchronous datalaoder based on the prototype for existing work. By looking at current code base, there are two ways to introduce it:

Add an option is_async into LoaderBase class, also other necessary variables for asynchronous operations. https://github.com/uber/petastorm/blob/cf1159dc04416ed737eec25bcecef5d5aafa805a/petastorm/pytorch.py#L132 This would introduce all async related logics into existing dataloader, for both non-batch and batch versions, which would make the dataloader implementation complicated.
Create a new asyncLoaderBase, also implement AsyncDataLoader and AsyncBatchedDataLoader classes. In this case user has to create select dataloader explicitly for sync or async usage.

@selitvin What do you think about above idea? And what your suggestions to go in future to merge this work into petastorm?

FYI @tgaddair

uber / petastorm

Implementing Asynchronous Data Shuffling Part #638