Does this mean the backpropagated loss become proportional to the gradient accumulation steps? Say you are doing 12 steps now with A40 gpu with 48 GB memory , since I use L4 GPU with 24 GB memory, I need to drop inference batch size by half and increase gradient accumulation steps. that would be equivalent to drop LR by 2.
Alternatively, I've been reworking on the dynamic sampler and I'm able to fit 20000 audio tokens training in 8 L4 gpu in 4 steps instead of 12. If I don't adjust LR, that means LR would be drop to 1/3.
Hello,
I'm starting to train Voicecraft on a custom dataset. I have a different hardware setup (L4 GPU instead A40) so I'm adjusting training configuration.
I noticed that you used unusually large gradient accumulation steps (12) and when you backpropagate, it looks like you didn't average by accumulation steps. https://github.com/jasonppy/VoiceCraft/blob/4873249ba3d6328ed60c652ef577c3c19d0a6b9a/steps/trainer.py#L87-L91 https://github.com/jasonppy/VoiceCraft/blob/4873249ba3d6328ed60c652ef577c3c19d0a6b9a/steps/trainer.py#L138-L141
Does this mean the backpropagated loss become proportional to the gradient accumulation steps? Say you are doing 12 steps now with A40 gpu with 48 GB memory , since I use L4 GPU with 24 GB memory, I need to drop inference batch size by half and increase gradient accumulation steps. that would be equivalent to drop LR by 2.
Alternatively, I've been reworking on the dynamic sampler and I'm able to fit 20000 audio tokens training in 8 L4 gpu in 4 steps instead of 12. If I don't adjust LR, that means LR would be drop to 1/3.
What do you think?
Thanks