openseg-group / openseg.pytorch

The official Pytorch implementation of OCNet, OCRNet, and SegFix.
MIT License
1.19k stars 140 forks source link

torch version for segfix training #64

Open cnnAndBn opened 3 years ago

cnnAndBn commented 3 years ago

hi @PkuRainBow @hsfzxjy @LayneH now I am using my own data to train a segfix model to do the post processing. as for the torch version , in your config files, you use the inplace_abn for all bn . and in the BN implementation, it seems only 0.4,,1.2 torch version are qualified. Can I use torch 1.5 or higher. another question is "syncbn" is ok for the segfix model? have you compared it with the inplace_abn.

cnnAndBn commented 3 years ago

hi,now I install torch in 1.1 , but another error reported in 'openseg.pytorch-master/lib/extensions/inplace_abn_1/", line 208, in backward ' z, var, weight, bias = ctx.saved_tensors RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [4, 320, 16, 16]], which is output 0 of InPlaceABNSyncBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

hsfzxjy commented 3 years ago

hi @PkuRainBow @hsfzxjy @LayneH now I am using my own data to train a segfix model to do the post processing. as for the torch version , in your config files, you use the inplace_abn for all bn . and in

the BN implementation, it seems only 0.4,,1.2 torch version are qualified. Can I use torch 1.5 or higher. another question is "syncbn" is ok for the segfix model? have you compared it with the inplace_abn.

We have a branch , in which some of the OCR supports SyncBN and distributed training on PyTorch1.7. You may try it out with SegFix.

cnnAndBn commented 3 years ago

how to fix torch.autograd.set_detect_anomaly(True). bug? I just modifid the environment. @hsfzxjy

hsfzxjy commented 3 years ago

how to fix torch.autograd.set_detect_anomaly(True). bug? I just modifid the environment. @hsfzxjy

I dont know what's going on with your code. The error means invalid inplace operations happened somewhere. To locate the operations, you can add torch.autograd.set_detch_anomaly(True) in, just before if __name__ == "__main__":. Then re-run the code, and post up the full traceback. By this we can check how to solve it.

cnnAndBn commented 3 years ago

@hsfzxjy sys:1: RuntimeWarning: Traceback of forward call that caused the error: File "/root/.local/conda/envs/mmdet-2.8/lib/python3.7/", line 890, in _bootstrap self._bootstrap_inner() File "/root/.local/conda/envs/mmdet-2.8/lib/python3.7/", line 926, in _bootstrap_inner File "/root/.local/conda/envs/mmdet-2.8/lib/python3.7/", line 870, in run self._target(*self._args, self._kwargs) File "/root/.local/lib/python3.7/site-packages/torch/nn/parallel/", line 59, in _worker output = module(*input, *kwargs) File "/root/.local/lib/python3.7/site-packages/torch/nn/modules/", line 493, in call result = self.forward(input, kwargs) File "/root/myWorkBase/code/openseg.pytorch-master/lib/models/nets/", line 76, in forward x = self.backbone(x_) File "/root/.local/lib/python3.7/site-packages/torch/nn/modules/", line 493, in call result = self.forward(*input, kwargs) File "/root/myWorkBase/code/openseg.pytorch-master/lib/models/backbones/hrnet/", line 735, in forward x_list.append(self.transition3i) File "/root/.local/lib/python3.7/site-packages/torch/nn/modules/", line 493, in call result = self.forward(*input, *kwargs) File "/root/.local/lib/python3.7/site-packages/torch/nn/modules/", line 92, in forward input = module(input) File "/root/.local/lib/python3.7/site-packages/torch/nn/modules/", line 493, in call result = self.forward(input, kwargs) File "/root/.local/lib/python3.7/site-packages/torch/nn/modules/", line 92, in forward input = module(input) File "/root/.local/lib/python3.7/site-packages/torch/nn/modules/", line 493, in call result = self.forward(*input, **kwargs) File "/root/myWorkBase/code/openseg.pytorch-master/lib/extensions/inplace_abn_1/", line 118, in forward, self.momentum, self.eps, self.activation, self.slope)

Traceback (most recent call last): File "../", line 230, in model.train() File "/root/myWorkBase/code/openseg.pytorch-master/segmentor/", line 346, in train self.train() File "/root/myWorkBase/code/openseg.pytorch-master/segmentor/", line 189, in train backward_loss.backward() File "/root/.local/lib/python3.7/site-packages/torch/", line 107, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/root/.local/lib/python3.7/site-packages/torch/autograd/", line 93, in backward allow_unreachable=True) # allow_unreachable flag File "/root/.local/lib/python3.7/site-packages/torch/autograd/", line 77, in apply return self._forward_cls.backward(self, args) File "/root/.local/lib/python3.7/site-packages/torch/autograd/", line 189, in wrapper outputs = fn(ctx, args) File "/root/myWorkBase/code/openseg.pytorch-master/lib/extensions/inplace_abn_1/", line 208, in backward z, var, weight, bias = ctx.saved_tensors

cnnAndBn commented 3 years ago

@LayneH another question is if use the new branch, the config option is consistent with the old one? in other words, don't need to modify the json and input argument parameters?

hsfzxjy commented 3 years ago

@dadada101 You can try modify all nn.ReLU(inplace=True) to nn.ReLU(inplace=False) in . This should solve the problem.

Jerry365 commented 2 years ago

@hsfzxjy Excuse me, I have a question for you.When I run scripts/cityscapes/segfix/ train 1, All the errors are 0, I don't know where I went wrong, thank you for your advice error