tbepler / protein-sequence-embedding-iclr2019

Source code for "Learning protein sequence embeddings using information from structure" - ICLR 2019
Other
253 stars 75 forks source link

RuntimeError: CUDNN_STATUS_EXECUTION_FAILED #6

Closed asawanggaa closed 5 years ago

asawanggaa commented 5 years ago

when run train_similarity_and_contact.py at 5th or 6th epoch

File "train_similarity_and_contact.py", line 585, in main() File "train_similarity_and_contact.py", line 563, in main eval_contacts(model, cmap_test_iterator, use_cuda) File "train_similarity_and_contact.py", line 189, in eval_contacts logits_this, y_this = predict_contacts(model, x, y_mb, use_cuda) File "train_similarity_and_contact.py", line 161, in predict_contacts z = model(x) # embed the sequences File "/root/anaconda3/envs/PSE_SI/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, kwargs) File "/jwang/protein-sequence-embedding-iclr2019/src/models/multitask.py", line 26, in forward return self.embedding(x) File "/root/anaconda3/envs/PSESI/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, **kwargs) File "/jwang/protein-sequence-embedding-iclr2019/src/models/embedding.py", line 129, in forward h, = self.rnn(h) File "/root/anaconda3/envs/PSE_SI/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, kwargs) File "/root/anaconda3/envs/PSE_SI/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 192, in forward output, hidden = func(input, self.all_weights, hx, batch_sizes) File "/root/anaconda3/envs/PSE_SI/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 323, in forward return func(input, *fargs, **fkwargs) File "/root/anaconda3/envs/PSE_SI/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 287, in forward dropout_ts) File "/root/anaconda3/envs/PSE_SI/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 287, in forward [46/1250] dropout_ts) RuntimeError: CUDNN_STATUS_EXECUTION_FAILED

tbepler commented 5 years ago

This is some sort of pytorch/CUDA error. It's possible that this is from running out of GPU RAM. See this thread for example: https://discuss.pytorch.org/t/cudnn-status-execution-failed/4441/12.

tbepler commented 5 years ago

I'm going to close this issue. Please reopen if it turns out to be a problem related to the training code and not pytorch/CUDA.