microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.15k stars 2.55k forks source link

LayoutLMv2 nan training loss and eval #331

Open victor-ab opened 3 years ago

victor-ab commented 3 years ago

Describe the bug Model I am using is LayoutLMv2 with custom dataset.

The problem arises when using:

To Reproduce Steps to reproduce the behavior:

run_funsd.py --do_eval=True --do_predict=True --do_train=True --early_stop_patience=4 --evaluation_strategy=epoch --fp16=True --load_best_model_at_end=True --max_train_samples=1000 --model_name_or_path=microsoft/layoutlmv2-base-uncased --num_train_epochs=30 --output_dir=/tmp/test-ner --overwrite_output_dir=True --report_to=wandb --save_strategy=epoch --save_total_limit=1 --warmup_ratio=0.1

Fortunately, I recorded everything with wandb.

image

After 8 epochs the training and eval loss went to nan, while the f1 score dropped suddenly. The samples per second increased significantly as well.

saksham-s commented 3 years ago

@victor-ab I was trying to use a dataset with a sample annotation like https://github.com/doc-analysis/DocBank/blob/master/DocBank_samples/DocBank_samples/10.tar_1701.04170.gz_TPNL_afterglow_evo_8.txt are there any pointers on how to convert the dataset to the run_funsd.py script format?

xianshu1 commented 3 years ago

@victor-ab Have you figured out where the problem is? I also used the official example script run_funsd.py with my own custom dataset and got nan training loss. The run_funsd.py script runs perfectly well on funsd dataset.

XueAdas commented 3 years ago

same question, maybe something wrong with spatial aware self attention

kbrajwani commented 3 years ago

Hey where is the option to provide custom dataset path in run_funds.py file.

bkwapong commented 3 years ago

@victor-ab @xianshu1 @XueAdas I had a similar issue. I tried a lower learning_rate (0.00001) and it works for me now but the training takes quite a long time to get the loss to the level I want it to be. I guess it is the price to pay.

jendrikjoe commented 3 years ago

I am seeing this behavior and after fighting it the whole day I wanted to share my progress on it. For me it originates in the sequence output of layoutlm containing NaN values, when doing something like below:

outputs = self.layoutlmv2(
        input_ids=input_ids,
        bbox=bbox,
        image=image,
        attention_mask=attention_mask,
        token_type_ids=token_type_ids,
        position_ids=position_ids,
        head_mask=head_mask,
        inputs_embeds=inputs_embeds,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
)
seq_length = input_ids.size(1)
sequence_output, image_output = (
        outputs[0][:, :seq_length],
        outputs[0][:, seq_length:],
)

The loss being NaN seems to be a secondary symptom ๐Ÿค”

It still seems to be random, but I am trying to narrow it down right now.

So far I excluded the following root causes:

As mentioned by @bkwapong lowering the learning rate below 5e-5, seems to resolve this issue ๐Ÿ˜‰ I hope this proves useful to some of you and will keep you posted if I find an actual solution ๐Ÿ™‚

jendrikjoe commented 3 years ago

Okay continuing my debug process here: In LayoutLMv2ForTokenClassification the loss is calculated in the following manor:

loss = None
if labels is not None:
    loss_fct = CrossEntropyLoss()

    if attention_mask is not None:
        active_loss = attention_mask.view(-1) == 1
        active_logits = logits.view(-1, self.num_labels)[active_loss]
        active_labels = labels.view(-1)[active_loss]
        loss = loss_fct(active_logits, active_labels)
    else:
        loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

The problem with this is, if active_labels contains a lot of -100 values. This is the case if one sets only_label_first_subword in the tokenizer to True. The issue hereby is that the CrossEntropyLoss ignores all losses where the label is -100. It takes all other values, and takes an average over them. Let's assume the extreme case where only one of the 512 labels is not -100. In that case the average is just the value divided by one. This as well means the gradient is only divided by one. Therefore, the gradient through the one active output is approximately 512 times higher than expected. If all labels are different from -100 the average is the sum of all losses divided by 512, meaning that the gradient is as well divided by 512.

What solved the problem for me is to calculate the loss' sum and divide by the maximum number of labels:

loss_fct = CrossEntropyLoss(reduction="sum")
if attention_mask is not None:
    active_loss = attention_mask.view(-1) == 1
    active_logits = logits.view(-1, self.num_labels)[active_loss]
    active_labels = labels.view(-1)[active_loss]
    class_loss = loss_fct(active_logits, active_labels)
else:
    class_loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

class_loss = class_loss / len(labels.view(-1))

This way the gradients are independent of the number of labels that are -100.

I hope this helps others ๐Ÿ™‚ If my assumptions are wrong, I would love some input ๐Ÿ‘

sathwikacharya commented 2 years ago

Hey I am having a similar issue where the logits outputted by the model after training is nan. Any reason why this is happening? I am training this on a custom dataset with 29 classes and 40000 data points. The steps followed are identical to this link except for a few tweaks : https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/RVL-CDIP/Fine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb

I do not know if this is a problem but I am training it on multiple gpus using the Accelerate API of huggingface. Any help to this is much appreciated

XueAdas commented 2 years ago

ย  ๆ‚จๅ‘็š„้‚ฎไปถๅทฒๆ”ถๅˆฐ๏ผŒ่ฐข่ฐข๏ผ ย  Your email has been received, thank you! Ihre e - mail bekommen, danke! ใ‚ใชใŸใฎใƒกใƒผใƒซใŒๅฑŠใใพใ—ใŸใŒใ€ใ‚ใ‚ŠใŒใจใ†ใ”ใ–ใ„ใพใ™๏ผโ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”Xue Xu ย ย ย Tel: @.***

magataro commented 2 years ago

Perhaps there is no problem with the loss calculation code. In my case, I got NaN value only when calculating loss with autocast(), but when I stopped using amp, I no longer get NaN value. I hope this will be helpful to you.

NaN with AMP is a known issue. https://github.com/pytorch/pytorch/issues/40497

XueAdas commented 2 years ago

ย  ๆ‚จๅ‘็š„้‚ฎไปถๅทฒๆ”ถๅˆฐ๏ผŒ่ฐข่ฐข๏ผ ย  Your email has been received, thank you! Ihre e - mail bekommen, danke! ใ‚ใชใŸใฎใƒกใƒผใƒซใŒๅฑŠใใพใ—ใŸใŒใ€ใ‚ใ‚ŠใŒใจใ†ใ”ใ–ใ„ใพใ™๏ผโ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”Xue Xu ย ย ย Tel: @.***