juexinwang / NRI-MD

Neural relational inference for molecular dynamics simulations
MIT License
54 stars 21 forks source link

RuntimeError:CUDA error: CUBLAS_STATUS_EXECUTION_FAILED #5

Open hxzwqw opened 2 years ago

hxzwqw commented 2 years ago

COMMAND: python main.py --encoder cnn --decoder rnn --encoder-dropout 0.05 --decoder-dropout 0.2

Namespace(batch_size=1, cuda=True, decoder='rnn', decoder_dropout=0.2, decoder_hidden=256, dims=6, dynamic_graph=True, edge_types=4, encoder='cnn', encoder_dropout=0.05, encoder_hidden=256, epochs=500, factor=True, gamma=0.5, hard=True, load_folder='', lr=0.0005, lr_decay=200, no_cuda=False, no_factor=False, num_residues=77, number_exp=56, number_expstart=0, prediction_steps=1, prior=True, save_folder='logs', seed=42, skip_first=True, temp=0.5, timesteps=50, var=5e-05) Testing with dynamically re-computed graph. Using factor graph CNN encoder. Using learned recurrent interaction net decoder. Using prior [0.91 0.03 0.03 0.03] Start Training... Traceback (most recent call last): File "main.py", line 419, in epoch, best_val_loss) File "main.py", line 209, in train logits = encoder(data, rel_rec, rel_send) File "/home/hxz/anaconda3/envs/netw_2.3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, **kwargs) File "/media/hxz/Mdel/NRI-MD/NRI-MD/modules.py", line 213, in forward edges = self.node2edge_temporal(inputs, rel_rec, rel_send) File "/media/hxz/Mdel/NRI-MD/NRI-MD/modules.py", line 182, in node2edge_temporal receivers = torch.matmul(rel_rec, x) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

Description: I have tried the default pdb file and even the default setting(without cnn or rnn),the error always occur. I have also tried to follow the error report and tried to debug, it seems endless to track. Thanks if someone or the authur can give me some advice.

DivingWhale commented 1 year ago

I met the same situation. Did you solve it?

UR-Free commented 1 month ago

It seems like a problem with cuda I solve it by turning off the torch.cuda, and just let it calculate on cpu