Thanks for the work!
I'd like to know if the optimizer SAM is also applicable to accelerated training, i.e. using automatic mixed precision like fp16. I tried to adopt SAM in my own training codes with fp16 on Pytorch, but Nan loss happens and the computed grad norm is Nan. Regular training using SGD gives no error. So I'm wondering if it is caused by some error in the Pytorch reimplementation or is it due to the limitation of SAM?
Thanks for the work! I'd like to know if the optimizer SAM is also applicable to accelerated training, i.e. using automatic mixed precision like fp16. I tried to adopt SAM in my own training codes with fp16 on Pytorch, but Nan loss happens and the computed grad norm is Nan. Regular training using SGD gives no error. So I'm wondering if it is caused by some error in the Pytorch reimplementation or is it due to the limitation of SAM?