RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

NeuralBricolage commented 3 years ago

Traceback (most recent call last): File "train.py", line 43, in model.data_dependent_initialize(data) File "/home/helena/CUT/models/cut_model.py", line 108, in data_dependent_initialize self.compute_G_loss().backward() # calculate graidents for G File "/home/helena/anaconda3/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/helena/anaconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 145, in backward Variable._execution_engine.run_backward( RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

hello, i'm aware that this issue was already brought up and the suggestion was to downgrade to PyTorch 1.4 which i'm trying to avoid being on CUDA 11 what i find interesting though that cycleGAN training works just fine with the same setup (CUDA 11.1, PyTorch 1.8) and on the same dataset any suggestions how to debug are welcome

layer19 commented 3 years ago

Got exactly the same problem on PyTorch 1.8 and Cuda 11.1 (trying to run default FastCUT train from example). Downgrading to PyTorch 1.4 and Cuda 9.2 doesn't help and leads to:

root@d63a8e8c3efe:/usr/src/CUT# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Tue_Jun_12_23:07:04_CDT_2018
Cuda compilation tools, release 9.2, V9.2.148

root@d63a8e8c3efe:/usr/src/CUT# CUDA_VISIBLE_DEVICES=0 python3 train.py  --gpu_ids 0 --dataroot ./datasets/grumpifycat --name grumpifycat_FastCUT --CUT_mode FastCUT --verbose --num_threads 0
----------------- Options ---------------
                 CUT_mode: FastCUT                              [default: CUT]
               batch_size: 1                             
                    beta1: 0.5                           
                    beta2: 0.999                         
          checkpoints_dir: ./checkpoints                 
           continue_train: False                         
                crop_size: 256                           
                 dataroot: ./datasets/grumpifycat               [default: placeholder]
             dataset_mode: unaligned                     
                direction: AtoB                          
              display_env: main                          
             display_freq: 400                           
               display_id: None                          
            display_ncols: 4                             
             display_port: 8097                          
           display_server: http://localhost              
          display_winsize: 256                           
               easy_label: experiment_name               
                    epoch: latest                        
              epoch_count: 1                             
          evaluation_freq: 5000                          
        flip_equivariance: True                          
                 gan_mode: lsgan                         
                  gpu_ids: 0                             
                init_gain: 0.02                          
                init_type: xavier                        
                 input_nc: 3                             
                  isTrain: True                                 [default: None]
               lambda_GAN: 1.0                           
               lambda_NCE: 10.0                          
                load_size: 286                           
                       lr: 0.0002                        
           lr_decay_iters: 50                            
                lr_policy: linear                        
         max_dataset_size: inf                           
                    model: cut                           
                 n_epochs: 150                           
           n_epochs_decay: 50                            
               n_layers_D: 3                             
                     name: grumpifycat_FastCUT                  [default: experiment_name]
                    nce_T: 0.07                          
                  nce_idt: False                         
nce_includes_all_negatives_from_minibatch: False                         
               nce_layers: 0,4,8,12,16                   
                      ndf: 64                            
                     netD: basic                         
                     netF: mlp_sample                    
                  netF_nc: 256                           
                     netG: resnet_9blocks                
                      ngf: 64                            
             no_antialias: False                         
          no_antialias_up: False                         
               no_dropout: True                          
                  no_flip: False                         
                  no_html: False                         
                    normD: instance                      
                    normG: instance                      
              num_patches: 256                           
              num_threads: 0                                    [default: 4]
                output_nc: 3
phase: train                         
                pool_size: 0                             
               preprocess: resize_and_crop               
          pretrained_name: None                          
               print_freq: 100                           
         random_scale_max: 3.0                           
             save_by_iter: False                         
          save_epoch_freq: 5                             
         save_latest_freq: 5000                          
           serial_batches: False                         
stylegan2_G_num_downsampling: 1                             
                   suffix:                               
         update_html_freq: 1000                          
                  verbose: True                                 [default: False]
----------------- End -------------------
dataset [UnalignedDataset] was created
model [CUTModel] was created
The number of training images = 214
Setting up a new session...
create web directory ./checkpoints/grumpifycat_FastCUT/web...
Traceback (most recent call last):
  File "train.py", line 43, in <module>
    model.data_dependent_initialize(data)
  File "/usr/src/CUT/models/cut_model.py", line 105, in data_dependent_initialize
    self.forward()                     # compute fake images: G(A)
  File "/usr/src/CUT/models/cut_model.py", line 154, in forward
    self.fake = self.netG(self.real)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call
    result = self.forward(*input, **kwargs)
  File "/usr/src/CUT/models/networks.py", line 1006, in forward
    fake = self.model(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

benz725 commented 3 years ago

have the same error reported. however, my display card is A100 which is recommended to use CUDA version above 11.0. So I cannot downgrade the cuda version. how the author will sovle this problem.

dashu233 commented 3 years ago

I solve this problem by replacing torch.randperm with np.random.permutation. It seems pytorch 1.8 uses a different method to produce random permutation for len>30000 which causes this bug.

JoshonSmith commented 3 years ago

I solve this problem by replacing torch.randperm with np.random.permutation. It seems pytorch 1.8 uses a different method to produce random permutation for len>30000 which causes this bug.

this works ！ envs: pytorch 1.8 cuda11 thanks !!!

xinwangxinwang commented 3 years ago

I solve this problem by replacing torch.randperm with np.random.permutation. It seems pytorch 1.8 uses a different method to produce random permutation for len>30000 which causes this bug.

this works ！ envs: pytorch 1.8 cuda11 thanks !!!

yes, it works. 'patch_id = torch.randperm(feat_reshape.shape[1], device=feats[0].device)' (models/networks.py, lines 565) patch_id = np.random.permutation(feat_reshape.shape[1])

Thank you!

taesungp commented 2 years ago

Thank you for the feedback and solution. I made the suggested change and pushed the code.

ErikValle commented 1 year ago

The issue has reappeared, although the previously mentioned patch has been applied. I used the environment.yml to set up a conda environment. Any suggestions?

taesungp / contrastive-unpaired-translation

RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #83