Training error : Relu variables has been modified by an inplace operation

piercus commented 6 months ago

Thanks for this work, very interesting paper

in place error raised

I faced following error while trying to run

python train.py

Result :

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [256, 1, 256]], which is output 0 of ReluBackward0

After doing some investigation i feel the problem is coming from self.activation = get_activation_fn('relu') and m.inplace = True

I have been able to find a workaround be using gelu instead of relu, but i'm still not sure why is this piece of code :

        for m in self.modules():
            if isinstance(m, nn.ReLU) or isinstance(m, nn.Dropout):
                m.inplace = True

Full trace

❯ python train.py 
Generator Learning Rate: 1e-05
/home/piercus/miniconda3/envs/mvanet/lib/python3.7/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/home/piercus/miniconda3/envs/mvanet/lib/python3.7/site-packages/PIL/Image.py:3179: DecompressionBombWarning: Image size (101824320 pixels) exceeds limit of 89478485 pixels, could be decompression bomb DOS attack.
  DecompressionBombWarning,
/home/piercus/miniconda3/envs/mvanet/lib/python3.7/site-packages/PIL/Image.py:3179: DecompressionBombWarning: Image size (102717153 pixels) exceeds limit of 89478485 pixels, could be decompression bomb DOS attack.
  DecompressionBombWarning,
/home/piercus/miniconda3/envs/mvanet/lib/python3.7/site-packages/torch/utils/data/dataloader.py:557: UserWarning: This DataLoader will create 12 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
/home/piercus/miniconda3/envs/mvanet/lib/python3.7/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='mean' instead.
  warnings.warn(warning.format(ret))
Generator Learning Rate: 1e-05
/home/piercus/miniconda3/envs/mvanet/lib/python3.7/site-packages/torch/nn/functional.py:3734: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.
  warnings.warn("nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.")
/home/piercus/miniconda3/envs/mvanet/lib/python3.7/site-packages/PIL/Image.py:3179: DecompressionBombWarning: Image size (102521250 pixels) exceeds limit of 89478485 pixels, could be decompression bomb DOS attack.
  DecompressionBombWarning,
/home/piercus/miniconda3/envs/mvanet/lib/python3.7/site-packages/torch/autograd/__init__.py:199: UserWarning: Error detected in ReluBackward0. Traceback of forward call that caused the error:
  File "train.py", line 103, in <module>
    sideout5, sideout4, sideout3, sideout2, sideout1, final, glb5, glb4, glb3, glb2, glb1, tokenattmap4, tokenattmap3,tokenattmap2,tokenattmap1= generator.forward(images)
  File "/home/piercus/repos/mvanet/model/MVANet.py", line 412, in forward
    e5 = self.multifieldcrossatt(loc_e5, glb_e5)  # (4,128,16,16)
  File "/home/piercus/miniconda3/envs/mvanet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/piercus/repos/mvanet/model/MVANet.py", line 141, in forward
    activated = self.activation(linear1)
  File "/home/piercus/miniconda3/envs/mvanet/lib/python3.7/site-packages/torch/nn/functional.py", line 1457, in relu
    result = torch.relu(input)
  File "/home/piercus/miniconda3/envs/mvanet/lib/python3.7/site-packages/torch/fx/traceback.py", line 57, in format_stack
    return traceback.format_stack()
 (Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.)
  allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "train.py", line 126, in <module>
    scaler.scale(loss).backward()
  File "/home/piercus/miniconda3/envs/mvanet/lib/python3.7/site-packages/torch/_tensor.py", line 489, in backward
    self, gradient, retain_graph, create_graph, inputs=inputs
  File "/home/piercus/miniconda3/envs/mvanet/lib/python3.7/site-packages/torch/autograd/__init__.py", line 199, in backward
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [256, 1, 256]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

qianyu-dlut commented 6 months ago

Thank you for taking the time to report this issue. To address the inplace operation problem, you may alter the code in ./model/MVANet.py at line 139 from:

g_hw_b_c = g_hw_b_c + self.dropout2(self.linear2(self.dropout(self.activation(self.linear1(g_hw_b_c)))))

to:

g_hw_b_c = g_hw_b_c + self.dropout2(self.linear2(self.dropout(self.activation(self.linear1(g_hw_b_c)).clone())))

This modification should resolve the issue. We have committed the corrected version, which is now available for direct use.

piercus commented 6 months ago

Great thanks a lot :-)

qianyu-dlut / MVANet

Training error : Relu variables has been modified by an inplace operation #2

in place error raised

Full trace