Open chongxiaoc opened 3 years ago
Had a discussion offline, start to work on moving the bottleneck data shuffling part
into an asynchronous thread.
https://github.com/uber/petastorm/blob/cf1159dc04416ed737eec25bcecef5d5aafa805a/petastorm/pytorch.py#L373
And see how it can hide/overlap the latency.
Existing dataloader implementation is sequential, for common use case, it limits training speed if data shuffle part is slow. This happens when user defines a pretty large
shuffling queue capacity
, whenever dataloader reads next row group,shuffling queue buffer
has to compress and reshuffle.An idea of using asynchronous dataloader is to move parquet file reading, shuffling and producing batches into an asynchronous thread, maintain a batch queue for consumer (training thread) and producer (dataloader thread). In this case, during main thread training and backpropogragion stages, asynchronous can keep generating batches, to overlap the cost.
We have an idea of introducing asynchronous datalaoder based on the prototype for existing work. By looking at current code base, there are two ways to introduce it:
Add an option
is_async
intoLoaderBase
class, also other necessary variables for asynchronous operations. https://github.com/uber/petastorm/blob/cf1159dc04416ed737eec25bcecef5d5aafa805a/petastorm/pytorch.py#L132 This would introduce all async related logics into existing dataloader, for both non-batch and batch versions, which would make the dataloader implementation complicated.Create a new
asyncLoaderBase
, also implementAsyncDataLoader
andAsyncBatchedDataLoader
classes. In this case user has to create select dataloader explicitly for sync or async usage.@selitvin What do you think about above idea? And what your suggestions to go in future to merge this work into petastorm?
FYI @tgaddair