lucidrains / lion-pytorch

🦁 Lion, new optimizer discovered by Google Brain using genetic algorithms that is purportedly better than Adam(w), in Pytorch
MIT License
2.05k stars 51 forks source link

Learning rate scaling for distributed training? #8

Open RahulBhalley opened 1 year ago

RahulBhalley commented 1 year ago

Hi @lucidrains, thanks for this implementation.

I wonder if you're using distributed training for your experiments. If so, as noted in Accelerate's docs, do you scale your learning rate (on top of downscaling for LION optimizer, even if you're not using Accelerate) based on number of processes (GPUs).

If you don't scale learning rate, do you recommend doing so?

RahulBhalley commented 1 year ago

Btw LION seems to update parameters per train step faster than Adam or AdamW.

simasima121 commented 1 year ago

I have run a few experiments and got unexpected results.

It seems as if Lion doesn't follow the traditional scaling law so far.

With a batch size of 64 across multiple GPUs, it doesn't matter how much we scale the LR; the training is only 10-20% faster.

For example, iteration 100's training loss with 1 GPU is 0.001 with bs of 64; if running on 4 GPUs (effective bs of 256), iteration 100 loss would be 0.0009 if LR is 4x bigger.

I have tried experiments with making LR the same, 2x bigger, 4x bigger and much larger, but it doesn't help.

Using Adam, the scaling laws would apply.

I would appreciate any ideas here.

Thanks

RahulBhalley commented 1 year ago

I have run a few experiments and got unexpected results.

It seems as if Lion doesn't follow the traditional scaling law so far.

With a batch size of 64 across multiple GPUs, it doesn't matter how much we scale the LR; the training is only 10-20% faster.

For example, iteration 100's training loss with 1 GPU is 0.001 with bs of 64; if running on 4 GPUs (effective bs of 256), iteration 100 loss would be 0.0009 if LR is 4x bigger.

I have tried experiments with making LR the same, 2x bigger, 4x bigger and much larger, but it doesn't help.

Using Adam, the scaling laws would apply.

I would appreciate any ideas here.

Thanks

@simasima121 That's interesting. Thanks! I wonder if you get any better results with LION.