Question regarding the how gradient accumulation is done. (It looks like we didn't /accumulation_steps when backprop loss )

Hello,

I'm starting to train Voicecraft on a custom dataset. I have a different hardware setup (L4 GPU instead A40) so I'm adjusting training configuration.

I noticed that you used unusually large gradient accumulation steps (12) and when you backpropagate, it looks like you didn't average by accumulation steps. https://github.com/jasonppy/VoiceCraft/blob/4873249ba3d6328ed60c652ef577c3c19d0a6b9a/steps/trainer.py#L87-L91 https://github.com/jasonppy/VoiceCraft/blob/4873249ba3d6328ed60c652ef577c3c19d0a6b9a/steps/trainer.py#L138-L141

Does this mean the backpropagated loss become proportional to the gradient accumulation steps? Say you are doing 12 steps now with A40 gpu with 48 GB memory , since I use L4 GPU with 24 GB memory, I need to drop inference batch size by half and increase gradient accumulation steps. that would be equivalent to drop LR by 2.

Alternatively, I've been reworking on the dynamic sampler and I'm able to fit 20000 audio tokens training in 8 L4 gpu in 4 steps instead of 12. If I don't adjust LR, that means LR would be drop to 1/3.

What do you think?

Thanks

jasonppy / VoiceCraft

Question regarding the how gradient accumulation is done. (It looks like we didn't /accumulation_steps when backprop loss ) #151