I try to full finetune the deepseek-vl on multi-nodes with deepspeed-zero2. The loss and acc work fine in the beginning: nomral steps vary from the number of used nodes. However, the training loss and acc drop to zero without warning after certain steps. Despite the loss down to zero, the training continues, and the eval will give a NaN loss result instead.
I try to full finetune the deepseek-vl on multi-nodes with deepspeed-zero2. The loss and acc work fine in the beginning: nomral steps vary from the number of used nodes. However, the training loss and acc drop to zero without warning after certain steps. Despite the loss down to zero, the training continues, and the eval will give a NaN loss result instead.
Do you have any idea of the problem? Thanks