Keep getting CUDA side assert triggered error

harunkuf commented 3 years ago

Hi, first of all thanks for this module, it's very helpful!

I'm trying to add this CRF model as an additional layer on top of BERT. However, I keep getting CUDA error due to ram limitations, I believe. For reference I have 9 target labels in total, seq_len is 200 and batch size is 64. It didn't matter if I tried even with a batch size of 1 and I still got CUDA side assert triggered error, which is weird.

I'm running the training on colab with a Tesla T4 GPU. Do you have any knowledge as to how I can fix this problem? Thanks!

kmkurn commented 3 years ago

Hi, that's weird that you still have OOM issue even with batch size of 1. The CRF layer memory consumption shouldn't depend on the sequence length so it should be pretty efficient. If you replace the CRF layer with just softmax, did you have the same problem? Have you tried it with very short inputs?

byew commented 3 years ago

### I also encountered such a problem, how can I solve it
_model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None File "/home/baiyang/ner/biobert-pytorch-master/named-entity-recognition/transformers/trainer.py", line 499, in train tr_loss += self._training_step(model, inputs, optimizer) File "/home/baiyang/ner/biobert-pytorch-master/named-entity-recognition/transformers/trainer.py", line 622, in _training_step outputs = model(inputs) File "/home/baiyang/anaconda3/envs/th/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/home/baiyang/ner/biobert-pytorch-master/named-entity-recognition/transformers/modeling_bert.py", line 1445, in forward log_likelihood, outputs = self.crf(logits, labels), self.crf.decode(logits) File "/home/baiyang/anaconda3/envs/th/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/home/baiyang/anaconda3/envs/th/lib/python3.7/site-packages/torchcrf/init.py", line 102, in forward numerator = self._compute_score(emissions, tags, mask) File "/home/baiyang/anaconda3/envs/th/lib/python3.7/site-packages/torchcrf/init.py", line 186, in _compute_score score = self.starttransitions[tags[0]] RuntimeError: CUDA error: device-side assert triggered

kmkurn commented 3 years ago

It seems the issue can be caused by various reasons: https://discuss.pytorch.org/t/runtimeerror-cuda-error-device-side-assert-triggered/34213. Looking at your stack trace @byew, what triggers the error is score = self.start_transitions[tags[0]]. Can you check if you have valid values for tags[0]? I.e. the tag IDs should be in [0, num_tags - 1].

gautierdag commented 3 years ago

Getting the same error:

/opt/conda/lib/python3.7/site-packages/torchcrf/__init__.py in _compute_score(self, emissions, tags, mask)
    185         # shape: (batch_size,)
    186         score = self.start_transitions[tags[0]]
--> 187         score += emissions[0, torch.arange(batch_size), tags[0]]
    188 
    189         for i in range(1, seq_length):

RuntimeError: CUDA error: device-side assert triggered

~~My guess is that torch.arange is creating the tensor on CPU, whereas the rest of the data is on GPU.~~

Actually probably not that, I think I might have an index in my labels that should be ignored that isn't.

kmkurn commented 3 years ago

It seems the problem is indeed caused by the invalid label/tag indices. I'm closing this issue.

kmkurn / pytorch-crf

Keep getting CUDA side assert triggered error #75