openseg-group / openseg.pytorch

The official Pytorch implementation of OCNet, OCRNet, and SegFix.
MIT License
1.19k stars 140 forks source link

torch version for segfix training #64

Open cnnAndBn opened 3 years ago

cnnAndBn commented 3 years ago

hi @PkuRainBow @hsfzxjy @LayneH now I am using my own data to train a segfix model to do the post processing. as for the torch version , in your config files, you use the inplace_abn for all bn . and in https://github.com/openseg-group/openseg.pytorch/blob/2c459f3b42deee26194f1802f353887d945e14c4/lib/models/tools/module_helper.py#L77 the BN implementation, it seems only 0.4,1.0.1.1,1.2 torch version are qualified. Can I use torch 1.5 or higher. another question is "syncbn" is ok for the segfix model? have you compared it with the inplace_abn.

cnnAndBn commented 3 years ago

hi,now I install torch in 1.1 , but another error reported in 'openseg.pytorch-master/lib/extensions/inplace_abn_1/functions.py", line 208, in backward ' z, var, weight, bias = ctx.saved_tensors RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [4, 320, 16, 16]], which is output 0 of InPlaceABNSyncBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

hsfzxjy commented 3 years ago

hi @PkuRainBow @hsfzxjy @LayneH now I am using my own data to train a segfix model to do the post processing. as for the torch version , in your config files, you use the inplace_abn for all bn . and in

https://github.com/openseg-group/openseg.pytorch/blob/2c459f3b42deee26194f1802f353887d945e14c4/lib/models/tools/module_helper.py#L77

the BN implementation, it seems only 0.4,1.0.1.1,1.2 torch version are qualified. Can I use torch 1.5 or higher. another question is "syncbn" is ok for the segfix model? have you compared it with the inplace_abn.

We have a branch https://github.com/openseg-group/openseg.pytorch/tree/pytorch-1.7 , in which some of the OCR supports SyncBN and distributed training on PyTorch1.7. You may try it out with SegFix.

cnnAndBn commented 3 years ago

how to fix torch.autograd.set_detect_anomaly(True). bug? I just modifid the environment. @hsfzxjy

hsfzxjy commented 3 years ago

how to fix torch.autograd.set_detect_anomaly(True). bug? I just modifid the environment. @hsfzxjy

I dont know what's going on with your code. The error means invalid inplace operations happened somewhere. To locate the operations, you can add torch.autograd.set_detch_anomaly(True) in main.py, just before if __name__ == "__main__":. Then re-run the code, and post up the full traceback. By this we can check how to solve it.

cnnAndBn commented 3 years ago

@hsfzxjy sys:1: RuntimeWarning: Traceback of forward call that caused the error: File "/root/.local/conda/envs/mmdet-2.8/lib/python3.7/threading.py", line 890, in _bootstrap self._bootstrap_inner() File "/root/.local/conda/envs/mmdet-2.8/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/root/.local/conda/envs/mmdet-2.8/lib/python3.7/threading.py", line 870, in run self._target(*self._args, self._kwargs) File "/root/.local/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, *kwargs) File "/root/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/root/myWorkBase/code/openseg.pytorch-master/lib/models/nets/segfix.py", line 76, in forward x = self.backbone(x_) File "/root/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "/root/myWorkBase/code/openseg.pytorch-master/lib/models/backbones/hrnet/hrnet_backbone.py", line 735, in forward x_list.append(self.transition3i) File "/root/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, *kwargs) File "/root/.local/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/root/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/root/.local/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/root/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, **kwargs) File "/root/myWorkBase/code/openseg.pytorch-master/lib/extensions/inplace_abn_1/bn.py", line 118, in forward self.training, self.momentum, self.eps, self.activation, self.slope)

Traceback (most recent call last): File "../main.py", line 230, in model.train() File "/root/myWorkBase/code/openseg.pytorch-master/segmentor/trainer.py", line 346, in train self.train() File "/root/myWorkBase/code/openseg.pytorch-master/segmentor/trainer.py", line 189, in train backward_loss.backward() File "/root/.local/lib/python3.7/site-packages/torch/tensor.py", line 107, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/root/.local/lib/python3.7/site-packages/torch/autograd/init.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag File "/root/.local/lib/python3.7/site-packages/torch/autograd/function.py", line 77, in apply return self._forward_cls.backward(self, args) File "/root/.local/lib/python3.7/site-packages/torch/autograd/function.py", line 189, in wrapper outputs = fn(ctx, args) File "/root/myWorkBase/code/openseg.pytorch-master/lib/extensions/inplace_abn_1/functions.py", line 208, in backward z, var, weight, bias = ctx.saved_tensors

cnnAndBn commented 3 years ago

@LayneH another question is if use the new branch https://github.com/openseg-group/openseg.pytorch/tree/pytorch-1.7, the config option is consistent with the old one? in other words, don't need to modify the json and input argument parameters?

hsfzxjy commented 3 years ago

@dadada101 You can try modify all nn.ReLU(inplace=True) to nn.ReLU(inplace=False) in https://github.com/openseg-group/openseg.pytorch/blob/master/lib/models/backbones/hrnet/hrnet_backbone.py . This should solve the problem.

Jerry365 commented 2 years ago

@hsfzxjy Excuse me, I have a question for you.When I run scripts/cityscapes/segfix/run_h_48_d_4_segfix.sh train 1, All the errors are 0, I don't know where I went wrong, thank you for your advice error