Open jens321 opened 1 year ago
Hey! You could try poking the authors with an email directly. I am not part of the authors, but my understanding is that they did it purely for data-parallel purposes; even the biggest VPT size fits into 32GB V100. With more GPUs they could shorten the training wall-clock time, so I guess they just used as many as they had available :D
In section 4.2 (on the VPT Foundation Model Training), the paper states that
Could you give some insight as to what required using this many GPUs? Did it have to do with data parallel, model parallel, or yet other reasons?
Thank you.