Learning rate scaling? - Githubissues

Hi, thanks for this great exploration of BYOL! I have a (perhaps mundane) question about the implementation here; you note in the README that

(the batch_size and lr differ from the moco documentation due to the way Pytorch-Lightning handles multi-gpu training in ddp -- the effective numbers are batch_size=256 and lr=0.03)

I understand that in the official MoCo code, they manually scale batch_size to be batch_size/n_gpus when using ddp (https://github.com/facebookresearch/moco/blob/master/main_moco.py#L174 for reference). So batch_size=32 makes sense to me, as Lightning's ddp wraps nn.parallel.

However, I don't really understand the change in lr - could you explain why you scale to lr*n_gpus? The MoCo example doesn't seem to do this scaling, so I'm wondering what about Lightning results in needing the change. Any input would be really appreciated!

imbue-ai / self_supervised

Learning rate scaling? #1