Closed harunkuf closed 3 years ago
Hi, that's weird that you still have OOM issue even with batch size of 1. The CRF layer memory consumption shouldn't depend on the sequence length so it should be pretty efficient. If you replace the CRF layer with just softmax, did you have the same problem? Have you tried it with very short inputs?
### I also encountered such a problem, how can I solve it
_model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
File "/home/baiyang/ner/biobert-pytorch-master/named-entity-recognition/transformers/trainer.py", line 499, in train
tr_loss += self._training_step(model, inputs, optimizer)
File "/home/baiyang/ner/biobert-pytorch-master/named-entity-recognition/transformers/trainer.py", line 622, in _training_step
outputs = model(inputs)
File "/home/baiyang/anaconda3/envs/th/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, *kwargs)
File "/home/baiyang/ner/biobert-pytorch-master/named-entity-recognition/transformers/modeling_bert.py", line 1445, in forward
log_likelihood, outputs = self.crf(logits, labels), self.crf.decode(logits)
File "/home/baiyang/anaconda3/envs/th/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(input, kwargs)
File "/home/baiyang/anaconda3/envs/th/lib/python3.7/site-packages/torchcrf/init.py", line 102, in forward
numerator = self._compute_score(emissions, tags, mask)
File "/home/baiyang/anaconda3/envs/th/lib/python3.7/site-packages/torchcrf/init.py", line 186, in _compute_score
score = self.starttransitions[tags[0]]
RuntimeError: CUDA error: device-side assert triggered
It seems the issue can be caused by various reasons: https://discuss.pytorch.org/t/runtimeerror-cuda-error-device-side-assert-triggered/34213. Looking at your stack trace @byew, what triggers the error is score = self.start_transitions[tags[0]]
. Can you check if you have valid values for tags[0]
? I.e. the tag IDs should be in [0, num_tags - 1]
.
Getting the same error:
/opt/conda/lib/python3.7/site-packages/torchcrf/__init__.py in _compute_score(self, emissions, tags, mask)
185 # shape: (batch_size,)
186 score = self.start_transitions[tags[0]]
--> 187 score += emissions[0, torch.arange(batch_size), tags[0]]
188
189 for i in range(1, seq_length):
RuntimeError: CUDA error: device-side assert triggered
My guess is that torch.arange
is creating the tensor on CPU, whereas the rest of the data is on GPU.
Actually probably not that, I think I might have an index in my labels that should be ignored that isn't.
It seems the problem is indeed caused by the invalid label/tag indices. I'm closing this issue.
Hi, first of all thanks for this module, it's very helpful!
I'm trying to add this CRF model as an additional layer on top of BERT. However, I keep getting CUDA error due to ram limitations, I believe. For reference I have 9 target labels in total, seq_len is 200 and batch size is 64. It didn't matter if I tried even with a batch size of 1 and I still got CUDA side assert triggered error, which is weird.
I'm running the training on colab with a Tesla T4 GPU. Do you have any knowledge as to how I can fix this problem? Thanks!