taishan1994 / BERT-BILSTM-CRF

使用BERT-BILSTM-CRF进行中文命名实体识别。
343 stars 42 forks source link

训练几个epoch后报错:RuntimeError: d.is_cuda() INTERNAL ASSERT FAILED #39

Open uncle-tou opened 4 months ago

uncle-tou commented 4 months ago

在指定device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")前提下 使用自带的duie和dgre数据集都会在训练几个Epoch之后抛出如下错误:

【train】6/100 40420/713100 loss:2.2920708656311035 【train】6/100 40430/713100 loss:0.8514504432678223 【train】6/100 40440/713100 loss:1.3389232158660889 Traceback (most recent call last): File "/home/qhm/Program/TaoYuan/BERT-BILSTM-CRF-ty/BERT-BILSTM-CRF-main/main.py", line 229, in main(data_name) File "/home/qhm/Program/TaoYuan/BERT-BILSTM-CRF-ty/BERT-BILSTM-CRF-main/main.py", line 220, in main train.train() File "/home/qhm/Program/TaoYuan/BERT-BILSTM-CRF-ty/BERT-BILSTM-CRF-main/main.py", line 54, in train output = self.model(input_ids, attention_mask, labels) File "/home/qhm/anaconda3/envs/TY_taishan/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/qhm/Program/TaoYuan/BERT-BILSTM-CRF-ty/BERT-BILSTM-CRF-main/model.py", line 34, in forward logits = self.crf.decode(seq_out, mask=attention_mask.bool()) File "/home/qhm/anaconda3/envs/TY_taishan/lib/python3.9/site-packages/torchcrf/init.py", line 139, in decode return self._viterbi_decode(emissions, mask) File "/home/qhm/anaconda3/envs/TY_taishan/lib/python3.9/site-packages/torchcrf/init.py", line 305, in _viterbi_decode score = torch.where(mask[i].unsqueeze(1), next_score, score) RuntimeError: d.is_cuda() INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1659484809662/work/c10/cuda/impl/CUDAGuardImpl.h":30, please report a bug to PyTorch.

uncle-tou commented 4 months ago

是设备问题。更换设备后没出现报错。原设备会掉显卡驱动。在另外的torch环境下,训练别的模型也会出现同样的报错。

hhu-aiboy commented 4 months ago

man!What can i say!

Vcan12600 commented 2 months ago

前几天,我也一直出现这个问题,我出现这个问题的时候使用cpu是i9-13900k,现在我换了颗新的cpu i7-14700k之后不再出现这个问题了

hhu-aiboy commented 1 month ago

前几天,我也一直出现这个问题,我出现这个问题的时候使用cpu是i9-13900k,现在我换了颗新的cpu i7-14700k之后不再出现这个问题了

难绷,这个程序确实是在139k上跑的