FP16 causes NAN after optimizer.step

jinxixiang / magic_animate_unofficial

An unoffical training code of Magic Animate

Apache License 2.0

35 stars 5 forks source link

FP16 causes NAN after optimizer.step #6

Open CacacaLalala opened 9 months ago

CacacaLalala commented 9 months ago

Hi, Thanks a lot for this repo!

I find this problem when I tried to train this model. If the precision of model is FP16, the loss becomes NAN. But FP32 will fix the problem. However we use V100 for training, FP32 will cause Out of Memory. Any solution here?

jinxixiang commented 8 months ago

Hi, loss Nan might be caused by training data or learning rate, etc. You should check your data, or lower your learning rate perhaps.

ljh0v0 commented 8 months ago

Hi. Increasing the adam_epsilon may help. The current value for adam_epsilong is 1e-8, which may be too small for fp16.