In distributed_maml.py, when are the gradients communicated across gpus?

learnables / learn2learn

A PyTorch Library for Meta-learning Research

MIT License

2.59k stars 348 forks source link

Hello,

Thanks to the example code, I could impelement maml with ddp for a seq2seq model.

While implementing the code, a question came up about the timing for gradients reducing. When we use a DDP wrapper for a model, every backward() steps implicitly reduces gradients across gpus, if I understood correctly.

In the example code I guess that cherry opt.step() # averages gradients across all workers does the job, as the comment says. Does it mean that gradients are not reduced at any moment before then, even at backward() ?

Thank you very much for your help!

learnables / learn2learn

In distributed_maml.py, when are the gradients communicated across gpus? #391