CUDA not executing during runtime

Stelath commented 2 years ago

CUDA for some reason fails to execute when running, I have the correct version of PyTorch and also have an NVIDIA driver installed on the system.

Thrown as a result of running the command: python pretrain_DAMSM.py --cfg cfg/DAMSM/book.yml --gpu 0

         'B_DCGAN': False,
         'CONDITION_DIM': 100,
         'DF_DIM': 64,
         'GF_DIM': 128,
         'R_NUM': 2,
         'Z_DIM': 100},
 'GPU_ID': 0,
 'RNN_TYPE': 'LSTM',
 'TEXT': {'CAPTIONS_PER_IMAGE': 1, 'EMBEDDING_DIM': 256, 'WORDS_NUM': 18},
 'TRAIN': {'BATCH_SIZE': 48,
           'B_NET_D': True,
           'DISCRIMINATOR_LR': 0.0002,
           'ENCODER_LR': 0.002,
           'FLAG': True,
           'GENERATOR_LR': 0.0002,
           'MAX_EPOCH': 600,
           'NET_E': '',
           'NET_G': '',
           'RNN_GRAD_CLIP': 0.25,
           'SMOOTH': {'GAMMA1': 4.0,
                      'GAMMA2': 5.0,
                      'GAMMA3': 10.0,
                      'LAMBDA': 1.0},
           'SNAPSHOT_INTERVAL': 50},
 'TREE': {'BASE_SIZE': 299, 'BRANCH_NUM': 1},
 'WORKERS': 1}
/opt/conda/envs/dm_gan/lib/python3.6/site-packages/torchvision/transforms/transforms.py:220: UserWarning: The use o
f the transforms.Scale transform is deprecated, please use transforms.Resize instead.                             
  "please use transforms.Resize instead.")
Load filenames from: ../data/books/train/filenames.pickle (4625)                                                  
Load filenames from: ../data/books/test/filenames.pickle (1622)                                                   
Load from:  ../data/books/captions.pickle
31146 1
Load filenames from: ../data/books/train/filenames.pickle (4625)                                                  
Load filenames from: ../data/books/test/filenames.pickle (1622)                                                   
Load from:  ../data/books/captions.pickle
/opt/conda/envs/dm_gan/lib/python3.6/site-packages/torch/nn/modules/rnn.py:50: UserWarning: dropout option adds dro
pout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.5
 and num_layers=1
  "num_layers={}".format(dropout, num_layers))
Load pretrained model from  https://download.pytorch.org/models/inception_v3_google-1a9a5a14.pth                  
Traceback (most recent call last):
  File "pretrain_DAMSM.py", line 350, in <module>
    dataset.ixtoword, image_dir, criterion)
  File "pretrain_DAMSM.py", line 87, in train
    words_features, sent_code = cnn_model(imgs[-1])
  File "/opt/conda/envs/dm_gan/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__     
    result = self.forward(*input, **kwargs)
  File "/books-nn/T2I_CL/DM-GAN+CL/code/model.py", line 208, in forward                                           
    x = self.Conv2d_1a_3x3(x)
  File "/opt/conda/envs/dm_gan/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__     
    result = self.forward(*input, **kwargs)
  File "/opt/conda/envs/dm_gan/lib/python3.6/site-packages/torchvision/models/inception.py", line 433, in forward 
    x = self.bn(x)
  File "/opt/conda/envs/dm_gan/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__     
    result = self.forward(*input, **kwargs)
  File "/opt/conda/envs/dm_gan/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 107, in forward   
    exponential_average_factor, self.eps)
  File "/opt/conda/envs/dm_gan/lib/python3.6/site-packages/torch/nn/functional.py", line 1670, in batch_norm      
    training, momentum, eps, torch.backends.cudnn.enabled                                                         
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

huiyegit commented 2 years ago

It seems the issue is the forward part of CNN model. You may check whether the input ''imgs[-1]'' is correct. Another thing is to try the source code under the AttnGAN+CL folder, as my experiment for 'pretrain_DAMSM.py' was done using this version.

Stelath commented 2 years ago

I fixed this, eventually I installed the correct version of CUDA toolkit on my machine, and a supported NVIDIA driver; still not entirely sure what the problem was though.

huiyegit / T2I_CL

CUDA not executing during runtime #9