lucidrains / audiolm-pytorch

Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch
MIT License
2.36k stars 255 forks source link

Learning rate scaling #143

Open hmartiro opened 1 year ago

hmartiro commented 1 year ago

I see the default learning rate of SoundStreamTrainer is 2e-4. I have a few questions:

  1. Should the LR be doubled if the batch size is doubled?
  2. Should the LR be doubled if the number of GPUs is doubled, such as training multi-GPU with with accelerate? Or is this effectively scaled inside train_step()?
  3. Should the LR be doubled if the gradient accumulation steps are doubled? I notice this implementation is doing a custom thing rather than using accelerate's accumulation steps.
lucidrains commented 1 year ago

@hmartiro oh hey Hayk! yeah, you know, even after all this time, I still don't know the answer to this. maybe an optimizer expert can stand up and say something more declarative, put this to rest

i think conventional rule of thumb had always been that LR should increase as batch size increases (which scales linearly with number of devices). however, i don't know what the exact relationship should be. and clearly there are some papers that ignore this (for example, recent Llama paper still used learning rate of 3e-4 even with batch size of 4 million...)

for gradient accumulation, huggingface was building that just as I started using accelerate, and when i last used it, it had a few rough edges. i'll give it another try with a new GAN project, and if it works well, redo the code. just being cautious