RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #83

Closed NeuralBricolage closed 2 years ago

NeuralBricolage commented 3 years ago

Traceback (most recent call last): File "train.py", line 43, in model.data_dependent_initialize(data) File "/home/helena/CUT/models/cut_model.py", line 108, in data_dependent_initialize self.compute_G_loss().backward() # calculate graidents for G File "/home/helena/anaconda3/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/helena/anaconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 145, in backward Variable._execution_engine.run_backward( RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

hello, i'm aware that this issue was already brought up and the suggestion was to downgrade to PyTorch 1.4 which i'm trying to avoid being on CUDA 11 what i find interesting though that cycleGAN training works just fine with the same setup (CUDA 11.1, PyTorch 1.8) and on the same dataset any suggestions how to debug are welcome

layer19 commented 3 years ago

Got exactly the same problem on PyTorch 1.8 and Cuda 11.1 (trying to run default FastCUT train from example). Downgrading to PyTorch 1.4 and Cuda 9.2 doesn't help and leads to:

root@d63a8e8c3efe:/usr/src/CUT# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Tue_Jun_12_23:07:04_CDT_2018
Cuda compilation tools, release 9.2, V9.2.148

root@d63a8e8c3efe:/usr/src/CUT# CUDA_VISIBLE_DEVICES=0 python3 train.py  --gpu_ids 0 --dataroot ./datasets/grumpifycat --name grumpifycat_FastCUT --CUT_mode FastCUT --verbose --num_threads 0
dataset [UnalignedDataset] was created
model [CUTModel] was created
The number of training images = 214
Setting up a new session...
create web directory ./checkpoints/grumpifycat_FastCUT/web...
Traceback (most recent call last):
  File "train.py", line 43, in <module>
  File "/usr/src/CUT/models/cut_model.py", line 105, in data_dependent_initialize
    self.forward()                     # compute fake images: G(A)
  File "/usr/src/CUT/models/cut_model.py", line 154, in forward
    self.fake = self.netG(self.real)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call
    result = self.forward(*input, **kwargs)
  File "/usr/src/CUT/models/networks.py", line 1006, in forward
    fake = self.model(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
benz725 commented 3 years ago

have the same error reported. however, my display card is A100 which is recommended to use CUDA version above 11.0. So I cannot downgrade the cuda version. how the author will sovle this problem.

dashu233 commented 3 years ago

I solve this problem by replacing torch.randperm with np.random.permutation. It seems pytorch 1.8 uses a different method to produce random permutation for len>30000 which causes this bug.

JoshonSmith commented 3 years ago

I solve this problem by replacing torch.randperm with np.random.permutation. It seems pytorch 1.8 uses a different method to produce random permutation for len>30000 which causes this bug.

this works ! envs: pytorch 1.8 cuda11 thanks !!!

xinwangxinwang commented 3 years ago

I solve this problem by replacing torch.randperm with np.random.permutation. It seems pytorch 1.8 uses a different method to produce random permutation for len>30000 which causes this bug.

this works ! envs: pytorch 1.8 cuda11 thanks !!!

yes, it works. 'patch_id = torch.randperm(feat_reshape.shape[1], device=feats[0].device)' (models/networks.py, lines 565) patch_id = np.random.permutation(feat_reshape.shape[1])

Thank you!

taesungp commented 2 years ago

Thank you for the feedback and solution. I made the suggested change and pushed the code.

ErikValle commented 1 year ago

The issue has reappeared, although the previously mentioned patch has been applied. I used the environment.yml to set up a conda environment. Any suggestions?