konstmish / prodigy

The Prodigy optimizer and its variants for training neural networks.
MIT License
296 stars 17 forks source link

Document incompatibility with gradient clipping #13

Open crypdick opened 6 months ago

crypdick commented 6 months ago

Hello, thanks for the great optimizer. I think the README should warn users not to use gradient clipping. During a failed model training postmortem, we realized that gradient clipping was confusing the prodigy optimizer, causing it to use inappropriately high learning rates. I didn't figure it out until I examined the prodigy internals for clues.

konstmish commented 6 months ago

Hi, thanks for coming by to share the feedback! Were you clipping the gradients by the norm (as opposed to value clipping)? We've used gradient clipping to have norm not exceeding 1, and Prodigy seemed to work fine. So the specific reasons it didn't work well in your case are a bit unclear to me, can you share more information about your setting?

crypdick commented 6 months ago

Hmm, interesting! Yes, we were norm clipping using a value of 5. We were smoke-testing a new library using a vanilla resnet50 with the optimizer swapped out for Prodigy.

When we trained for 1 epoch using Prodigy, the gradient norms were pinned at their max value of 5, and the dlr kept steadily climbing up until we would start getting spikes in the loss and infs in the gradient norm. I happen to have put a screenshot in my experiment notebook, here:

image (1)

x-axis is mini-batch count. The legend says lr but it's actually dlr.

When we disabled gradient clipping and repeated the test run (keeping everything else equal), the dlr behaved totally different: it initially ramped to an equilibrium point, with occasional steps up. Overall, the dlr plateaued at a value 20x lower than with grad clipping enabled, and we didn't see numerical instability. Unfortunately I did not put a screenshot in my experiment log.