9 days on 720 GPUs? - Githubissues

openai / Video-Pre-Training

Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos

MIT License

1.36k stars 146 forks source link

In section 4.2 (on the VPT Foundation Model Training), the paper states that

Preliminary experiments suggested that our model could benefit from 30 epochs of training and that a 0.5 billion parameter model was required to stay in the efficient learning regime63 for that training duration (Appendix H), which took ∼9 days on 720 V100 GPUs.

Could you give some insight as to what required using this many GPUs? Did it have to do with data parallel, model parallel, or yet other reasons?

Thank you.

openai / Video-Pre-Training

9 days on 720 GPUs? #24