LayoutLM NaN Loss while Training

Hey I am having this issue where the loss outputted by the model during training is nan. This usually happens after the 3rd epoch. I am training this on a custom dataset with 29 classes and 40000 data points. The steps followed are identical to this link except for a few tweaks : https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/RVL-CDIP/Fine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb The training is being done on AWS SageMaker notebook instance. The accelerate API (the notebook_launcher() function to be more precise) is also used to leverage training on multiple GPUs. Moreover the output of the logits for all test case predictions is also NaN.

Any help to this is much appreciated.

Thank you

microsoft / unilm

LayoutLM NaN Loss while Training #622