Open RahulBhalley opened 1 year ago
Btw LION seems to update parameters per train step faster than Adam or AdamW.
I have run a few experiments and got unexpected results.
It seems as if Lion doesn't follow the traditional scaling law so far.
With a batch size of 64 across multiple GPUs, it doesn't matter how much we scale the LR; the training is only 10-20% faster.
For example, iteration 100's training loss with 1 GPU is 0.001 with bs of 64; if running on 4 GPUs (effective bs of 256), iteration 100 loss would be 0.0009 if LR is 4x bigger.
I have tried experiments with making LR the same, 2x bigger, 4x bigger and much larger, but it doesn't help.
Using Adam, the scaling laws would apply.
I would appreciate any ideas here.
Thanks
I have run a few experiments and got unexpected results.
It seems as if Lion doesn't follow the traditional scaling law so far.
With a batch size of 64 across multiple GPUs, it doesn't matter how much we scale the LR; the training is only 10-20% faster.
For example, iteration 100's training loss with 1 GPU is 0.001 with bs of 64; if running on 4 GPUs (effective bs of 256), iteration 100 loss would be 0.0009 if LR is 4x bigger.
I have tried experiments with making LR the same, 2x bigger, 4x bigger and much larger, but it doesn't help.
Using Adam, the scaling laws would apply.
I would appreciate any ideas here.
Thanks
@simasima121 That's interesting. Thanks! I wonder if you get any better results with LION.
Hi @lucidrains, thanks for this implementation.
I wonder if you're using distributed training for your experiments. If so, as noted in Accelerate's docs, do you scale your learning rate (on top of downscaling for LION optimizer, even if you're not using Accelerate) based on number of processes (GPUs).
If you don't scale learning rate, do you recommend doing so?