speedinghzl / pytorch-segmentation-toolbox

PyTorch Implementations for DeeplabV3 and PSPNet
MIT License
768 stars 167 forks source link

divide by zero error for running_var.mul_() #9

Closed amiltonwong closed 5 years ago

amiltonwong commented 5 years ago

Hi, @speedinghzl

I got RuntimeError: invalid argument 3: divide by zero for running_var.mul_((1 - ctx.momentum)).add_(ctx.momentum * var * n / (n - 1)) in functions.py", line 209, in forward

2975 images are loaded!
Traceback (most recent call last):
  File "train.py", line 251, in <module>
    main()
  File "train.py", line 215, in main
    preds = model(images)
  File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 112, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/code8/pytorch-segmentation-toolbox/networks/pspnet.py", line 148, in forward
    x = self.head(x)
  File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/code8/pytorch-segmentation-toolbox/networks/pspnet.py", line 85, in forward
    priors = [F.upsample(input=stage(feats), size=(h, w), mode='bilinear', align_corners=True) for stage in self.stages] + [feats]
  File "/data/code8/pytorch-segmentation-toolbox/networks/pspnet.py", line 85, in <listcomp>
    priors = [F.upsample(input=stage(feats), size=(h, w), mode='bilinear', align_corners=True) for stage in self.stages] + [feats]
  File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/code8/pytorch-segmentation-toolbox/libs/bn.py", line 184, in forward
    self.activation, self.slope)
  File "/data/code8/pytorch-segmentation-toolbox/libs/functions.py", line 209, in forward
    running_var.mul_((1 - ctx.momentum)).add_(ctx.momentum * var * n / (n - 1))
RuntimeError: invalid argument 3: divide by zero at /pytorch/aten/src/THC/generic/THCTensorMathPairwise.cu:88
(tf1.3) root@milton-ThinkCentre-M93p:/data/code8/pytorch-segmentation-toolbox#

Any suggestion to fix it?

THX!

LeiyuanMa commented 5 years ago

When I set the BS=1,I also ran into these proble,but BS=2 works fine...

amiltonwong commented 5 years ago

@LeiyuanMa , thx for the suggestion.

After using BS=2, the training process starts but after iter=3, it ran out of memory. (although my GPU has 12 GB memory).

2975 images are loaded!
iter = 0 of 400 completed, loss = 4.142366409301758
taking snapshot ...
iter = 1 of 400 completed, loss = 3.235548496246338
iter = 2 of 400 completed, loss = 2.8805861473083496
iter = 3 of 400 completed, loss = 1.8505399227142334
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "train.py", line 251, in <module>
    main()
  File "train.py", line 218, in main
    loss.backward()
  File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58
LeiyuanMa commented 5 years ago

I didn't ran into this question,but the BS=2 can't reprodcue satisfy results,we should try with larger batch_size