vid-koci / bert-commonsense

Code for papers "A Surprisingly Robust Trick for Winograd Schema Challenge" and "WikiCREM: A Large Unsupervised Corpus for Coreference Resolution"
70 stars 13 forks source link

for those with Caught StopIteration in replica error #6

Open xiaoouwang opened 3 years ago

xiaoouwang commented 3 years ago

The author's code is based on torch 0.4.1, however many people may have GPUs no longer supported by cuda < 11 and have to use some more recent versions like torch 1.8

If you use cuda < 11 you would run into the following error:

RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THC/THCBlas.cu:411

If you use the correct cuda version, then the StopIteration error would appear if you use multiple gpus. I believe that this issue has been raised since torch 1.5, see https://github.com/huggingface/transformers/issues/3936

To stop the bug by hand, just correct the following line in pytorch_pretrained_bert/modeling.py

extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility

to

extended_attention_mask = extended_attention_mask.to(dtype=torch.float32) # fp16 compatibility

I won't make a pull request coz I don't know what's the impact of this change on torch < 1.5.