openai / Video-Pre-Training

Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos
MIT License
1.36k stars 146 forks source link

9 days on 720 GPUs? #24

Open jens321 opened 1 year ago

jens321 commented 1 year ago

In section 4.2 (on the VPT Foundation Model Training), the paper states that

Preliminary experiments suggested that our model could benefit from 30 epochs of training and that a 0.5 billion parameter model was required to stay in the efficient learning regime63 for that training duration (Appendix H), which took ∼9 days on 720 V100 GPUs.

Could you give some insight as to what required using this many GPUs? Did it have to do with data parallel, model parallel, or yet other reasons?

Thank you.

Miffyli commented 1 year ago

Hey! You could try poking the authors with an email directly. I am not part of the authors, but my understanding is that they did it purely for data-parallel purposes; even the biggest VPT size fits into 32GB V100. With more GPUs they could shorten the training wall-clock time, so I guess they just used as many as they had available :D