mlfoundations / open_clip

An open source implementation of CLIP.
Other
9.14k stars 908 forks source link

Training speed slow #861

Closed lezhang7 closed 2 months ago

lezhang7 commented 2 months ago

Hi,

I found that training speed was slow down if number of gpus is more than 2, is it because more gpus brings larger batch size to compute and gather_all will take up some time?

Best

rwightman commented 2 months ago

@lezhang7 you see less samples per second per gpu as you increase the # of gpus, but the total samples / sec should increase until you saturate your interconnect. You could have a broken distributed setup if more than two causes a significant slowdown. Or slow disks causing IO bottlenecks reading your dataset...