Closed NeuralBricolage closed 2 years ago
Got exactly the same problem on PyTorch 1.8 and Cuda 11.1 (trying to run default FastCUT train from example). Downgrading to PyTorch 1.4 and Cuda 9.2 doesn't help and leads to:
root@d63a8e8c3efe:/usr/src/CUT# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Tue_Jun_12_23:07:04_CDT_2018
Cuda compilation tools, release 9.2, V9.2.148
root@d63a8e8c3efe:/usr/src/CUT# CUDA_VISIBLE_DEVICES=0 python3 train.py --gpu_ids 0 --dataroot ./datasets/grumpifycat --name grumpifycat_FastCUT --CUT_mode FastCUT --verbose --num_threads 0
----------------- Options ---------------
CUT_mode: FastCUT [default: CUT]
batch_size: 1
beta1: 0.5
beta2: 0.999
checkpoints_dir: ./checkpoints
continue_train: False
crop_size: 256
dataroot: ./datasets/grumpifycat [default: placeholder]
dataset_mode: unaligned
direction: AtoB
display_env: main
display_freq: 400
display_id: None
display_ncols: 4
display_port: 8097
display_server: http://localhost
display_winsize: 256
easy_label: experiment_name
epoch: latest
epoch_count: 1
evaluation_freq: 5000
flip_equivariance: True
gan_mode: lsgan
gpu_ids: 0
init_gain: 0.02
init_type: xavier
input_nc: 3
isTrain: True [default: None]
lambda_GAN: 1.0
lambda_NCE: 10.0
load_size: 286
lr: 0.0002
lr_decay_iters: 50
lr_policy: linear
max_dataset_size: inf
model: cut
n_epochs: 150
n_epochs_decay: 50
n_layers_D: 3
name: grumpifycat_FastCUT [default: experiment_name]
nce_T: 0.07
nce_idt: False
nce_includes_all_negatives_from_minibatch: False
nce_layers: 0,4,8,12,16
ndf: 64
netD: basic
netF: mlp_sample
netF_nc: 256
netG: resnet_9blocks
ngf: 64
no_antialias: False
no_antialias_up: False
no_dropout: True
no_flip: False
no_html: False
normD: instance
normG: instance
num_patches: 256
num_threads: 0 [default: 4]
output_nc: 3
phase: train
pool_size: 0
preprocess: resize_and_crop
pretrained_name: None
print_freq: 100
random_scale_max: 3.0
save_by_iter: False
save_epoch_freq: 5
save_latest_freq: 5000
serial_batches: False
stylegan2_G_num_downsampling: 1
suffix:
update_html_freq: 1000
verbose: True [default: False]
----------------- End -------------------
dataset [UnalignedDataset] was created
model [CUTModel] was created
The number of training images = 214
Setting up a new session...
create web directory ./checkpoints/grumpifycat_FastCUT/web...
Traceback (most recent call last):
File "train.py", line 43, in <module>
model.data_dependent_initialize(data)
File "/usr/src/CUT/models/cut_model.py", line 105, in data_dependent_initialize
self.forward() # compute fake images: G(A)
File "/usr/src/CUT/models/cut_model.py", line 154, in forward
self.fake = self.netG(self.real)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/usr/src/CUT/models/networks.py", line 1006, in forward
fake = self.model(input)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 345, in forward
return self.conv2d_forward(input, self.weight)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
have the same error reported. however, my display card is A100 which is recommended to use CUDA version above 11.0. So I cannot downgrade the cuda version. how the author will sovle this problem.
I solve this problem by replacing torch.randperm
with np.random.permutation
. It seems pytorch 1.8 uses a different method to produce random permutation for len>30000 which causes this bug.
I solve this problem by replacing
torch.randperm
withnp.random.permutation
. It seems pytorch 1.8 uses a different method to produce random permutation for len>30000 which causes this bug.
this works ! envs: pytorch 1.8 cuda11 thanks !!!
I solve this problem by replacing
torch.randperm
withnp.random.permutation
. It seems pytorch 1.8 uses a different method to produce random permutation for len>30000 which causes this bug.this works ! envs: pytorch 1.8 cuda11 thanks !!!
yes, it works. 'patch_id = torch.randperm(feat_reshape.shape[1], device=feats[0].device)' (models/networks.py, lines 565) patch_id = np.random.permutation(feat_reshape.shape[1])
Thank you!
Thank you for the feedback and solution. I made the suggested change and pushed the code.
The issue has reappeared, although the previously mentioned patch has been applied. I used the environment.yml to set up a conda environment. Any suggestions?
Traceback (most recent call last): File "train.py", line 43, in
model.data_dependent_initialize(data)
File "/home/helena/CUT/models/cut_model.py", line 108, in data_dependent_initialize
self.compute_G_loss().backward() # calculate graidents for G
File "/home/helena/anaconda3/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/helena/anaconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 145, in backward
Variable._execution_engine.run_backward(
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
hello, i'm aware that this issue was already brought up and the suggestion was to downgrade to PyTorch 1.4 which i'm trying to avoid being on CUDA 11 what i find interesting though that cycleGAN training works just fine with the same setup (CUDA 11.1, PyTorch 1.8) and on the same dataset any suggestions how to debug are welcome