Open CacacaLalala opened 9 months ago
Hi, loss Nan might be caused by training data or learning rate, etc. You should check your data, or lower your learning rate perhaps.
Hi. Increasing the adam_epsilon may help. The current value for adam_epsilong is 1e-8, which may be too small for fp16.
Hi, Thanks a lot for this repo!
I find this problem when I tried to train this model. If the precision of model is FP16, the loss becomes NAN. But FP32 will fix the problem. However we use V100 for training, FP32 will cause Out of Memory. Any solution here?