clip_grad_norm applied to scaled gradients

On this line, grad clipping occurs:

However, if fp16 is enabled then the clipping would be applied to the scaled gradients, due to GradScaler.

According to PyTorch documentation (https://pytorch.org/docs/master/notes/amp_examples.html#gradient-clipping), the gradients should be unscaled before clipping.

So, this appears to be a bug and could cause fp16 training to result in worse performance than it otherwise should.

harubaru / waifu-diffusion