I’ve found that setting max_grad_norm has no effect, and we are not clipping gradients.
For verification, I ran convergence with max_grad_norm 1e-9 and saw no difference in eval loss, and checked the unscale_and_clip_grads and the self.clip_grad is set to 0 when I printed it here.
Discussed in Training WG (3/28): @itayhubara is verifying if setting this value correctly affect convergence & if this can improve convergence or reduce coefficienct of variance in RCPs.
I’ve found that setting max_grad_norm has no effect, and we are not clipping gradients.
For verification, I ran convergence with max_grad_norm 1e-9 and saw no difference in eval loss, and checked the unscale_and_clip_grads and the self.clip_grad is set to 0 when I printed it here.