laughtervv / DepthAwareCNN

Depth-aware CNN for RGB-D Segmentation, ECCV 2018
MIT License
304 stars 81 forks source link

an illegal memory access was encountered #20

Open garrett-cn opened 5 years ago

garrett-cn commented 5 years ago

CUDA_LAUNCH_BLOCKING=1 python train.py --name nyuv2_VGGdeeplab_depthconv --dataset_mode nyuv2 --flip --scale --crop --colorjitter --depthconv --list dataset/sunrgbd_training.lst ------------ Options ------------- batchSize: 1 beta1: 0.5 checkpoints_dir: ./checkpoints colorjitter: True continue_train: False crop: True dataroot: dataset_mode: nyuv2 debug: False decoder: psp_bilinear depthconv: True depthglobalpool: False display_freq: 100 display_winsize: 512 encoder: resnet50_dilated8 fineSize: [480, 640] flip: True gpu_ids: [0] inputmode: bgr-mean isTrain: True iterSize: 10 label_nc: 40 list: dataset/sunrgbd_training.lst loadfroms: False lr: 0.00025 lr_power: 0.9 max_dataset_size: inf maxbatchsize: -1 model: DeeplabVGG momentum: 0.9 nThreads: 1 name: nyuv2_VGGdeeplab_depthconv nepochs: 100 no_html: False phase: train pretrained_model: pretrained_model_HHA: pretrained_model_rgb: print_freq: 100 save_epoch_freq: 10 save_latest_freq: 1000 scale: True serial_batches: False tf_log: False use_softmax: False vallist: verbose: False warmup_iters: 500 wd: 0.0004 which_epoch: latest which_epoch_HHA: latest which_epoch_rgb: latest -------------- End ---------------- CustomDatasetDataLoader dataset [NYUDataset] was created

training images = 795

model [BaseModel] was created create web directory ./checkpoints/nyuv2_VGGdeeplab_depthconv/web... /home/cgl/miniconda3/envs/torch0.40/lib/python3.6/site-packages/torch/nn/functional.py:1749: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. "See the documentation of nn.Upsample for details.".format(mode)) /home/cgl/code/Git/DepthAwareCNN/models/Deeplab.py:106: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number self.averageloss += [self.loss.data[0]] error in depthconv_col2im: an illegal memory access was encountered THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMath.cu line=26 error=77 : an illegal memory access was encountered Traceback (most recent call last): File "train.py", line 55, in model.backward(total_steps, opt.nepochs dataset.len() opt.batchSize + 1) File "/home/cgl/code/Git/DepthAwareCNN/models/Deeplab.py", line 112, in backward self.loss.backward() File "/home/cgl/miniconda3/envs/torch0.40/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/cgl/miniconda3/envs/torch0.40/lib/python3.6/site-packages/torch/autograd/init.py", line 89, in backward allow_unreachable=True) # allow_unreachable flag File "/home/cgl/code/Git/DepthAwareCNN/models/ops/depthconv/functions/depthconv.py", line 91, in backward gradweight = weight.new(*weight.size()).zero() RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/generic/THCTensorMath.cu:26

gaoxiaoninghit commented 5 years ago

have you solve the problem?my problem is same with you

Elena-ssq commented 5 years ago

I'm facing the same error. Any clue on solving this?

hanchaoleng commented 5 years ago

I have the same problem.

phoebe0920 commented 4 years ago

l also met this problem. self.depth = Variable(data['depth']).cuda() RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/generic/THCTensorCopy.c:21 I try to the following methods, and finally succeed to train. l wish these can help you . 1.you can try to check your class of your datasets. l use SUNRGBD datatset but l forgot to change the label_class.

  1. try to add this sentence: torch.backends.cudnn.benchmark = True 3.you can try to print data to check whether all right
  2. l remove the data augmentation and it can succeed to train , so l locate this error to this. And then l add data augmentation one by one , it occurred to the scale, so l decrease the scale.
xieyuhaoli commented 4 years ago

d

大佬,我没懂你的解决方法