Closed Vectorrent closed 1 year ago
I think I understand now. The warmup steps are a bit like revving the engine in your car; it's not about training the model at all. It's about encouraging the user to create a dedicated warmup step, where you bring a GPU into "parallel processing" mode, and synchronize with the weights you're about to feed. This is a quantum process, of course, and should be run in a dedicated step, before training ever begins. This probably doesn't make sense to you, but it makes sense to me, so I'll just close the issue for now.
When training a model with both gradient_accumulation_steps and warmup_steps enabled, the network is completely unable to learn at all. For example, when using:
The model fails to learn a single thing. However, if you set:
The model looks quite different. I looked at the code, and it seems like both parameters are being passed to Pytorch-Lightning. So, maybe an upstream issue?
Anyway, logging the issue for posterity.