Open rwightman opened 10 months ago
Hello Ross,
Thank you for sharing your findings!
I also have similar findings that q/k/v Matmul and the division need to be float32 during training to avoid NaN loss. We currently do not have a good remedy for this. Given that the q/k/v Matmul and the division are lightweight, your current approach is an excellent workaround to bypass the problem. Certainly, we will delve further into this matter and will keep you updated once we identify an effective solution.
Regarding the evaluation stability, I am not sure whether changing the eps to 1e-5 will hurt the accuracy or not. If possible, I think keeping the division in float32 during testing is a better solution since its computation cost is negligible.
Thank you, Han
Hello, a contributor recently added EfficientViT to
timm
so I explored the model before merging... I found that it could not train in mixed precision without instantly having NaN loss. The problem appears to be the q/k/v matmuls and the divisionHave you observed similar or thought of any approaches to improve this?