Open victor-ab opened 3 years ago
@victor-ab I was trying to use a dataset with a sample annotation like https://github.com/doc-analysis/DocBank/blob/master/DocBank_samples/DocBank_samples/10.tar_1701.04170.gz_TPNL_afterglow_evo_8.txt are there any pointers on how to convert the dataset to the run_funsd.py script format?
@victor-ab Have you figured out where the problem is? I also used the official example script run_funsd.py
with my own custom dataset and got nan training loss. The run_funsd.py
script runs perfectly well on funsd dataset.
same question, maybe something wrong with spatial aware self attention
Hey where is the option to provide custom dataset path in run_funds.py file.
@victor-ab @xianshu1 @XueAdas I had a similar issue. I tried a lower learning_rate (0.00001) and it works for me now but the training takes quite a long time to get the loss to the level I want it to be. I guess it is the price to pay.
I am seeing this behavior and after fighting it the whole day I wanted to share my progress on it.
For me it originates in the sequence output of layoutlm containing NaN
values, when doing something like below:
outputs = self.layoutlmv2(
input_ids=input_ids,
bbox=bbox,
image=image,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
seq_length = input_ids.size(1)
sequence_output, image_output = (
outputs[0][:, :seq_length],
outputs[0][:, seq_length:],
)
The loss being NaN
seems to be a secondary symptom ๐ค
It still seems to be random, but I am trying to narrow it down right now.
So far I excluded the following root causes:
NaN
0
NaN
As mentioned by @bkwapong lowering the learning rate below 5e-5, seems to resolve this issue ๐ I hope this proves useful to some of you and will keep you posted if I find an actual solution ๐
Okay continuing my debug process here:
In LayoutLMv2ForTokenClassification
the loss is calculated in the following manor:
loss = None
if labels is not None:
loss_fct = CrossEntropyLoss()
if attention_mask is not None:
active_loss = attention_mask.view(-1) == 1
active_logits = logits.view(-1, self.num_labels)[active_loss]
active_labels = labels.view(-1)[active_loss]
loss = loss_fct(active_logits, active_labels)
else:
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
The problem with this is, if active_labels contains a lot of -100
values.
This is the case if one sets only_label_first_subword
in the tokenizer to True
.
The issue hereby is that the CrossEntropyLoss ignores all losses where the label is -100
. It takes all other values, and takes an average over them.
Let's assume the extreme case where only one of the 512 labels is not -100
. In that case the average is just the value divided by one. This as well means the gradient is only divided by one. Therefore, the gradient through the one active output is approximately 512 times higher than expected.
If all labels are different from -100
the average is the sum of all losses divided by 512, meaning that the gradient is as well divided by 512.
What solved the problem for me is to calculate the loss' sum and divide by the maximum number of labels:
loss_fct = CrossEntropyLoss(reduction="sum")
if attention_mask is not None:
active_loss = attention_mask.view(-1) == 1
active_logits = logits.view(-1, self.num_labels)[active_loss]
active_labels = labels.view(-1)[active_loss]
class_loss = loss_fct(active_logits, active_labels)
else:
class_loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
class_loss = class_loss / len(labels.view(-1))
This way the gradients are independent of the number of labels that are -100
.
I hope this helps others ๐ If my assumptions are wrong, I would love some input ๐
Hey I am having a similar issue where the logits outputted by the model after training is nan. Any reason why this is happening? I am training this on a custom dataset with 29 classes and 40000 data points. The steps followed are identical to this link except for a few tweaks : https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/RVL-CDIP/Fine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb
I do not know if this is a problem but I am training it on multiple gpus using the Accelerate API of huggingface. Any help to this is much appreciated
ย ๆจๅ็้ฎไปถๅทฒๆถๅฐ๏ผ่ฐข่ฐข๏ผ ย Your email has been received, thank you! Ihre e - mail bekommen, danke! ใใชใใฎใกใผใซใๅฑใใพใใใใใใใใจใใใใใพใ๏ผโโโโโโโโโโโโโโโโโโโโโโโโโโXue Xu ย ย ย Tel: @.***
Perhaps there is no problem with the loss calculation code. In my case, I got NaN value only when calculating loss with autocast(), but when I stopped using amp, I no longer get NaN value. I hope this will be helpful to you.
NaN with AMP is a known issue. https://github.com/pytorch/pytorch/issues/40497
ย ๆจๅ็้ฎไปถๅทฒๆถๅฐ๏ผ่ฐข่ฐข๏ผ ย Your email has been received, thank you! Ihre e - mail bekommen, danke! ใใชใใฎใกใผใซใๅฑใใพใใใใใใใใจใใใใใพใ๏ผโโโโโโโโโโโโโโโโโโโโโโโโโโXue Xu ย ย ย Tel: @.***
Describe the bug Model I am using is LayoutLMv2 with custom dataset.
The problem arises when using:
To Reproduce Steps to reproduce the behavior:
Fortunately, I recorded everything with wandb.
After 8 epochs the training and eval loss went to nan, while the f1 score dropped suddenly. The samples per second increased significantly as well.