hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
32.68k stars 4.01k forks source link

grad norm 为 NAN, loss为0 #5627

Open lambda-lee opened 1 week ago

lambda-lee commented 1 week ago

Reminder

System Info

Reproduction

训练一段时间后,先出现 {'loss': 1.4596, 'grad_norm': nan, 'learning_rate': 4.552255167404752e-06, 'epoch': 0.66} 接着就都是 {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.549912997495027e-06, 'epoch': 0.66} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.549912997495027e-06, 'epoch': 0.66} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.549912997495027e-06, 'epoch': 0.66}

Expected behavior

怎么实现训练过程中能够跳过出现错误的数据,继续执行训练,谢谢

Others

No response

hiyouga commented 1 week ago

换个模型试试?

Eternal-Yan commented 1 day ago

我也是这种情况,请问你解决了嘛?