uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Apache License 2.0

1.8k stars 284 forks source link

Pytorch: add AsyncBatchedDataloader #641

Closed chongxiaoc closed 2 years ago

chongxiaoc commented 3 years ago

Add an asynchronous thread to shuffle batches in background. Main thread consumes ready shuffled batches in the queue. Related to #638

codecov[bot] commented 3 years ago

Codecov Report

Merging #641 (45a7a80) into master (e89a7fe) will increase coverage by 0.14%. The diff coverage is 95.71%.

@@            Coverage Diff             @@
##           master     #641      +/-   ##
==========================================
+ Coverage   85.23%   85.38%   +0.14%     
==========================================
  Files          85       85              
  Lines        4985     5055      +70     
  Branches      792      806      +14     
==========================================
+ Hits         4249     4316      +67     
- Misses        596      597       +1     
- Partials      140      142       +2

Impacted Files	Coverage Δ
petastorm/pytorch.py	`93.20% <95.71%> (+0.97%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e89a7fe...45a7a80. Read the comment docs.

chongxiaoc commented 3 years ago

@tgaddair I fixed above comments, most of changes are related to make async shuffle thread work for reusing dataloader for the same reader multiple times. In that case, we have to launch a new async thread for every new pass of files (async thread has exited for previous pass when reach end of data).

Please take a look, thanks.

chongxiaoc commented 3 years ago

@selitvin Any comments to debug the dockercloud-ci failure? Link is not working to me, showing 404.

chongxiaoc commented 3 years ago

Do you think we can factor out new functionality into a separate class (separation of concerns: the queue thread / rest of the data loader) to keep class size more manageable?

Good point, I will think about it.

Test cases become hairy with the parallelism introduced to the DataLoader implementation with some potential for race conditions. Would like to make sure we cover cases like cases like :

We exit DataLoader context while shuffling queue is full

We exit DataLoader without fetching any samples

An error is raised somewhere within worker thread

possibly more cases...

I will try to add some tests hitting above cases.

chongxiaoc commented 3 years ago

@selitvin I've refactored the async dataloader into a new class. Please take a look when you have time and let me know what you think about. Thanks.

chongxiaoc commented 3 years ago

convert to draft for now.

chongxiaoc commented 2 years ago

close this PR since async process is added into Horovod