Multi GPU with gradient accumulation

Hi! While training on multi GPU and using gradient accumulation steps > 1 there's no substantial speedup with relation to a single GPU (there is a speedup if the value is equal to 1). I found following threads on huggingface here and here that seem to provide a solution. I even ran a dummy test by just adding a proper argument to Accelerator, and actually the training was much faster (in your class I set the gradient accumulation steps to 1, but for Accelerator to 8, but I didn't make other changes to take into account this modification, so the results weren't particularly useful 😉). If you have time to check if this is interesting for you, I'd be grateful.

lucidrains / gigagan-pytorch

Multi GPU with gradient accumulation #37