Open hmartiro opened 1 year ago
@hmartiro oh hey Hayk! yeah, you know, even after all this time, I still don't know the answer to this. maybe an optimizer expert can stand up and say something more declarative, put this to rest
i think conventional rule of thumb had always been that LR should increase as batch size increases (which scales linearly with number of devices). however, i don't know what the exact relationship should be. and clearly there are some papers that ignore this (for example, recent Llama paper still used learning rate of 3e-4
even with batch size of 4 million...)
for gradient accumulation, huggingface was building that just as I started using accelerate, and when i last used it, it had a few rough edges. i'll give it another try with a new GAN project, and if it works well, redo the code. just being cautious
I see the default learning rate of
SoundStreamTrainer
is 2e-4. I have a few questions:train_step()
?