Open peterjc123 opened 2 months ago
similar issue here - when I put grads = gradfilter_ema(model, grads)
after the call to scaler.unscale_(optimizer)
the scale goes to 0 and i get nans for the step loss
Thank you for the valuable report! This should be because of the increased gradient norm due to the added low-pass filtered gradient.
The code here is basically for the proof-of-concept demonstration of acceleration of grokking in the previously known scenarios. For larger models, I suspect there should be more sophisticated control of the step size of the gradient updates, especially with mixed precision training you have mentioned. I will revise the code to add more compatibility to train larger models in the next version.
Hi, I'm trying out Grokfast in a LLM scenario. Mixed precision training is a commonly-used technique to save GPU memory usage and speedup training. The following code is an example for FP16 training.
The question is where should I put
grads = gradfilter_ema(model, grads)
? I tried to put this betweenscale
andunscale
, but it doesn't work, the loss scale just explodes.