LayoutLMv2 nan training loss and eval

victor-ab commented 3 years ago

Describe the bug Model I am using is LayoutLMv2 with custom dataset.

The problem arises when using:

[x] the official example scripts: I am using the same run_funsd.py, but using a modified dataset.

To Reproduce Steps to reproduce the behavior:

run_funsd.py --do_eval=True --do_predict=True --do_train=True --early_stop_patience=4 --evaluation_strategy=epoch --fp16=True --load_best_model_at_end=True --max_train_samples=1000 --model_name_or_path=microsoft/layoutlmv2-base-uncased --num_train_epochs=30 --output_dir=/tmp/test-ner --overwrite_output_dir=True --report_to=wandb --save_strategy=epoch --save_total_limit=1 --warmup_ratio=0.1

Fortunately, I recorded everything with wandb.

After 8 epochs the training and eval loss went to nan, while the f1 score dropped suddenly. The samples per second increased significantly as well.

Platform:
Python version: 3.7.1
PyTorch version (GPU?): tesla T4

saksham-s commented 3 years ago

@victor-ab I was trying to use a dataset with a sample annotation like https://github.com/doc-analysis/DocBank/blob/master/DocBank_samples/DocBank_samples/10.tar_1701.04170.gz_TPNL_afterglow_evo_8.txt are there any pointers on how to convert the dataset to the run_funsd.py script format?

xianshu1 commented 3 years ago

@victor-ab Have you figured out where the problem is? I also used the official example script run_funsd.py with my own custom dataset and got nan training loss. The run_funsd.py script runs perfectly well on funsd dataset.

XueAdas commented 3 years ago

same question, maybe something wrong with spatial aware self attention

kbrajwani commented 3 years ago

Hey where is the option to provide custom dataset path in run_funds.py file.

bkwapong commented 3 years ago

@victor-ab @xianshu1 @XueAdas I had a similar issue. I tried a lower learning_rate (0.00001) and it works for me now but the training takes quite a long time to get the loss to the level I want it to be. I guess it is the price to pay.

jendrikjoe commented 3 years ago

I am seeing this behavior and after fighting it the whole day I wanted to share my progress on it. For me it originates in the sequence output of layoutlm containing NaN values, when doing something like below:

outputs = self.layoutlmv2(
        input_ids=input_ids,
        bbox=bbox,
        image=image,
        attention_mask=attention_mask,
        token_type_ids=token_type_ids,
        position_ids=position_ids,
        head_mask=head_mask,
        inputs_embeds=inputs_embeds,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
)
seq_length = input_ids.size(1)
sequence_output, image_output = (
        outputs[0][:, :seq_length],
        outputs[0][:, seq_length:],
)

The loss being NaN seems to be a secondary symptom 🤔

It still seems to be random, but I am trying to narrow it down right now.

So far I excluded the following root causes:

Non of the parameters is NaN
Non of the parameters is 0
Non of the inputs is NaN
The input ids are all within the range of the embeddings

As mentioned by @bkwapong lowering the learning rate below 5e-5, seems to resolve this issue 😉 I hope this proves useful to some of you and will keep you posted if I find an actual solution 🙂

jendrikjoe commented 3 years ago

Okay continuing my debug process here: In LayoutLMv2ForTokenClassification the loss is calculated in the following manor:

loss = None
if labels is not None:
    loss_fct = CrossEntropyLoss()

    if attention_mask is not None:
        active_loss = attention_mask.view(-1) == 1
        active_logits = logits.view(-1, self.num_labels)[active_loss]
        active_labels = labels.view(-1)[active_loss]
        loss = loss_fct(active_logits, active_labels)
    else:
        loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

The problem with this is, if active_labels contains a lot of -100 values. This is the case if one sets only_label_first_subword in the tokenizer to True. The issue hereby is that the CrossEntropyLoss ignores all losses where the label is -100. It takes all other values, and takes an average over them. Let's assume the extreme case where only one of the 512 labels is not -100. In that case the average is just the value divided by one. This as well means the gradient is only divided by one. Therefore, the gradient through the one active output is approximately 512 times higher than expected. If all labels are different from -100 the average is the sum of all losses divided by 512, meaning that the gradient is as well divided by 512.

What solved the problem for me is to calculate the loss' sum and divide by the maximum number of labels:

loss_fct = CrossEntropyLoss(reduction="sum")
if attention_mask is not None:
    active_loss = attention_mask.view(-1) == 1
    active_logits = logits.view(-1, self.num_labels)[active_loss]
    active_labels = labels.view(-1)[active_loss]
    class_loss = loss_fct(active_logits, active_labels)
else:
    class_loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

class_loss = class_loss / len(labels.view(-1))

This way the gradients are independent of the number of labels that are -100.

I hope this helps others 🙂 If my assumptions are wrong, I would love some input 👍

sathwikacharya commented 2 years ago

Hey I am having a similar issue where the logits outputted by the model after training is nan. Any reason why this is happening? I am training this on a custom dataset with 29 classes and 40000 data points. The steps followed are identical to this link except for a few tweaks : https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/RVL-CDIP/Fine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb

I do not know if this is a problem but I am training it on multiple gpus using the Accelerate API of huggingface. Any help to this is much appreciated

XueAdas commented 2 years ago

您发的邮件已收到，谢谢！ Your email has been received, thank you! Ihre e - mail bekommen, danke! あなたのメールが届きましたが、ありがとうございます！——————————————————————————Xue Xu Tel: @.***

magataro commented 2 years ago

Perhaps there is no problem with the loss calculation code. In my case, I got NaN value only when calculating loss with autocast(), but when I stopped using amp, I no longer get NaN value. I hope this will be helpful to you.

NaN with AMP is a known issue. https://github.com/pytorch/pytorch/issues/40497

XueAdas commented 2 years ago

您发的邮件已收到，谢谢！ Your email has been received, thank you! Ihre e - mail bekommen, danke! あなたのメールが届きましたが、ありがとうございます！——————————————————————————Xue Xu Tel: @.***

microsoft / unilm

LayoutLMv2 nan training loss and eval #331