Learning rate scaling - Githubissues

lucidrains / audiolm-pytorch

Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch

MIT License

2.36k stars 255 forks source link

@hmartiro oh hey Hayk! yeah, you know, even after all this time, I still don't know the answer to this. maybe an optimizer expert can stand up and say something more declarative, put this to rest

i think conventional rule of thumb had always been that LR should increase as batch size increases (which scales linearly with number of devices). however, i don't know what the exact relationship should be. and clearly there are some papers that ignore this (for example, recent Llama paper still used learning rate of 3e-4 even with batch size of 4 million...)

for gradient accumulation, huggingface was building that just as I started using accelerate, and when i last used it, it had a few rough edges. i'll give it another try with a new GAN project, and if it works well, redo the code. just being cautious

lucidrains / audiolm-pytorch

Learning rate scaling #143