Closed asawanggaa closed 5 years ago
This is some sort of pytorch/CUDA error. It's possible that this is from running out of GPU RAM. See this thread for example: https://discuss.pytorch.org/t/cudnn-status-execution-failed/4441/12.
I'm going to close this issue. Please reopen if it turns out to be a problem related to the training code and not pytorch/CUDA.
when run train_similarity_and_contact.py at 5th or 6th epoch
File "train_similarity_and_contact.py", line 585, in
main()
File "train_similarity_and_contact.py", line 563, in main
eval_contacts(model, cmap_test_iterator, use_cuda)
File "train_similarity_and_contact.py", line 189, in eval_contacts
logits_this, y_this = predict_contacts(model, x, y_mb, use_cuda)
File "train_similarity_and_contact.py", line 161, in predict_contacts
z = model(x) # embed the sequences
File "/root/anaconda3/envs/PSE_SI/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, kwargs)
File "/jwang/protein-sequence-embedding-iclr2019/src/models/multitask.py", line 26, in forward
return self.embedding(x)
File "/root/anaconda3/envs/PSESI/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/jwang/protein-sequence-embedding-iclr2019/src/models/embedding.py", line 129, in forward
h, = self.rnn(h)
File "/root/anaconda3/envs/PSE_SI/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, kwargs)
File "/root/anaconda3/envs/PSE_SI/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 192, in forward
output, hidden = func(input, self.all_weights, hx, batch_sizes)
File "/root/anaconda3/envs/PSE_SI/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 323, in forward
return func(input, *fargs, **fkwargs)
File "/root/anaconda3/envs/PSE_SI/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 287, in forward
dropout_ts)
File "/root/anaconda3/envs/PSE_SI/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 287, in forward [46/1250]
dropout_ts)
RuntimeError: CUDNN_STATUS_EXECUTION_FAILED