training fails with "RuntimeError: cuda runtime error (11) : invalid argument at THCGeneral.cpp:405"

betogulliver commented 3 years ago

RTX 2080 Ti
python                    3.7.7                hcff3b4d_5  
cuda100                   1.0                           0    pytorch
pytorch                   0.4.1           py37_py36_py35_py27__9.0.176_7.1.2_2    pytorch
torchvision               0.2.1                      py_2    pytorch
CUDA Version 10.2.89
cudnn 7.6.4

I have succesfully run :

    sh run_test.sh

but after trying :

    sh run_train_val.sh

I go the error (details below)

    RuntimeError: cuda runtime error (11) : invalid argument at THCGeneral.cpp:405 #1566

I have tried the following tips below but the same error remains.

    RuntimeError: cuda runtime error (11) : invalid argument at THCGeneral.cpp:405 #1566
    https://github.com/fastai/fastai/issues/1566
    conda install -c pytorch cuda100

   THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=383 error=11 : invalid argument #21154
   https://github.com/pytorch/pytorch/issues/21154
    Didn't work for me. RTX2080, Cuda 10, Pytorch 1.3. :(

any ideas?

thanks again for all your help

(structure_knowledge_distillation) user@voyager% sh run_train_val.sh
INFO     D_att_ckpt_path : ./ckpt/save_path/Att_discriminator
INFO     D_ckpt_path : ./ckpt/save_path/Distriminator
INFO     D_resume : True
INFO     S_ckpt_path : ./ckpt/save_path/Student
INFO     S_resume : True
INFO     T_ckpt_path : ./ckpt/Teacher/CS_scenes_38413_0.7832174615268139.pth
INFO     adv_conv_dim : 64
INFO     adv_loss_type : wgan-gp
INFO     batch_size : 8
INFO     best_mean_IU : 0.0
INFO     classes_num : 19
INFO     data_dir : ./data/cityscapes
INFO     data_list : ./dataset/list/cityscapes/train.lst
INFO     data_set : cityscape
INFO     device : cuda
INFO     epoch_nums : 1
INFO     gpu : 0
INFO     gpu_num : 1
INFO     ho : True
INFO     ignore_label : 255
INFO     imsize_for_adv : 65
INFO     input_size : 512,512
INFO     is_student_load_imgnet : True
INFO     is_training : False
INFO     lambda_d : 0.1
INFO     lambda_gp : 10.0
INFO     lambda_pa : 0.5
INFO     lambda_pi : 10.0
INFO     last_step : 0
INFO     log_path : ./ckpt/log/save_path
INFO     lr_d : 0.0004
INFO     lr_g : 0.01
INFO     momentum : 0.9
INFO     num_steps : 40000
INFO     pa : True
INFO     parallel : False
INFO     pi : True
INFO     pool_scale : 0.5
INFO     power : 0.9
INFO     preprocess_GAN_mode : 1
INFO     random_mirror : True
INFO     random_scale : True
INFO     recurrence : 1
INFO     save_name : save_path
INFO     snapshot_dir : ./snapshots/
INFO     start_epoch : 0
INFO     student_pretrain_model_imgnet : ./dataset/resnet18-imagenet.pth
INFO     weight_decay : 0.0005
321300 images are loaded!
500 images are loaded!
INFO     ------------
INFO     => load./dataset/resnet18-imagenet.pth
INFO     ------------
INFO     student_model: Number of params: 13.07M
INFO     ------------
INFO     => no teacher ckpt find
INFO     ------------
INFO     teacher_model: Number of params: 70.44M
INFO     ------------
INFO     => checkpoint './ckpt/save_path/Distriminator/model_best.pth.tar' does not exit
INFO     ------------
INFO     D_model: Number of params: 3.20M
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
Traceback (most recent call last):
  File "train_and_eval.py", line 25, in <module>
    model.optimize_parameters()
  File "/home/user/work/projects/structure_knowledge_distillation/networks/kd_model.py", line 168, in optimize_parameters
    self.forward()
  File "/home/user/work/projects/structure_knowledge_distillation/networks/kd_model.py", line 122, in forward
    self.preds_T = self.parallel_teacher.eval()(self.images, parallel=args.parallel)
  File "/home/user/anaconda3/envs/structure_knowledge_distillation/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/work/projects/structure_knowledge_distillation/utils/parallel.py", line 106, in forward
    return super().forward(inputs, **kwargs)
  File "/home/user/anaconda3/envs/structure_knowledge_distillation/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 121, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/user/anaconda3/envs/structure_knowledge_distillation/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/work/projects/structure_knowledge_distillation/networks/pspnet_combine.py", line 177, in forward
    x = self.relu1(self.bn1(self.conv1(x)))
  File "/home/user/anaconda3/envs/structure_knowledge_distillation/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/anaconda3/envs/structure_knowledge_distillation/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 301, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (11) : invalid argument at /opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/THCGeneral.cpp:663
(structure_knowledge_distillation) user@voyager%

wl082013 commented 3 years ago

torch 0.41 is not matched with cuda 10 for RTX 2080, you need to either update torch version or degrade Cuda to 9.0, but RTX 2080 may fail.

Shawn207 commented 2 years ago

torch 0.41 is not matched with cuda 10 for RTX 2080, you need to either update torch version or degrade Cuda to 9.0, but RTX 2080 may fail.

Hi, I just got exactly the same issue here. However, I am using cuda-9.0-pytorch-0.4.1 docker with python=3.5(followed the instruction). Do you have any idea about that?

irfanICMLL / structure_knowledge_distillation

training fails with "RuntimeError: cuda runtime error (11) : invalid argument at THCGeneral.cpp:405" #43