uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Implementing Asynchronous Data Shuffling Part #638

Open chongxiaoc opened 3 years ago

chongxiaoc commented 3 years ago

Existing dataloader implementation is sequential, for common use case, it limits training speed if data shuffle part is slow. This happens when user defines a pretty large shuffling queue capacity, whenever dataloader reads next row group, shuffling queue buffer has to compress and reshuffle.

An idea of using asynchronous dataloader is to move parquet file reading, shuffling and producing batches into an asynchronous thread, maintain a batch queue for consumer (training thread) and producer (dataloader thread). In this case, during main thread training and backpropogragion stages, asynchronous can keep generating batches, to overlap the cost.

We have an idea of introducing asynchronous datalaoder based on the prototype for existing work. By looking at current code base, there are two ways to introduce it:

  1. Add an option is_async into LoaderBase class, also other necessary variables for asynchronous operations. https://github.com/uber/petastorm/blob/cf1159dc04416ed737eec25bcecef5d5aafa805a/petastorm/pytorch.py#L132 This would introduce all async related logics into existing dataloader, for both non-batch and batch versions, which would make the dataloader implementation complicated.

  2. Create a new asyncLoaderBase, also implement AsyncDataLoader and AsyncBatchedDataLoader classes. In this case user has to create select dataloader explicitly for sync or async usage.

@selitvin What do you think about above idea? And what your suggestions to go in future to merge this work into petastorm?

FYI @tgaddair

chongxiaoc commented 3 years ago

Had a discussion offline, start to work on moving the bottleneck data shuffling part into an asynchronous thread. https://github.com/uber/petastorm/blob/cf1159dc04416ed737eec25bcecef5d5aafa805a/petastorm/pytorch.py#L373

And see how it can hide/overlap the latency.