Accelerated training with floating point fp16

google-research / sam

Apache License 2.0

565 stars 72 forks source link

Accelerated training with floating point fp16 #3

Open milliema opened 3 years ago

milliema commented 3 years ago

Thanks for the work! I'd like to know if the optimizer SAM is also applicable to accelerated training, i.e. using automatic mixed precision like fp16. I tried to adopt SAM in my own training codes with fp16 on Pytorch, but Nan loss happens and the computed grad norm is Nan. Regular training using SGD gives no error. So I'm wondering if it is caused by some error in the Pytorch reimplementation or is it due to the limitation of SAM?

shuo-ouyang commented 3 years ago

Maybe we should add a small number (such as 1e-9) to avoid divided by zero when computing errors.

celidos commented 2 years ago

Interested in this topic too