microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.62k stars 2.5k forks source link

LayoutLM NaN Loss while Training #622

Open sathwikacharya opened 2 years ago

sathwikacharya commented 2 years ago

Hey I am having this issue where the loss outputted by the model during training is nan. This usually happens after the 3rd epoch. I am training this on a custom dataset with 29 classes and 40000 data points. The steps followed are identical to this link except for a few tweaks : https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/RVL-CDIP/Fine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb The training is being done on AWS SageMaker notebook instance. The accelerate API (the notebook_launcher() function to be more precise) is also used to leverage training on multiple GPUs. Moreover the output of the logits for all test case predictions is also NaN.

Any help to this is much appreciated.

Thank you

Rithsek99 commented 6 months ago

@sathwikacharya did you resolve this issue? I'm facing the same issue, especially switching from microsoft/layoutlmv3-base to microsoft/layoutlmv3-large.