the problem when change the out_put channel of roi

liuyanzhi1214 commented 3 years ago

I have the sane problem as #4310,but the answer was not useful. I just incease the out_put channel of roi When I use torch.cat to splice tensor. error

Traceback (most recent call last):
  File "/media/e706/disk_1/liuyanzhi/mmdetection-master/tools/train.py", line 181, in <module>
    main()
  File "/media/e706/disk_1/liuyanzhi/mmdetection-master/tools/train.py", line 177, in main
    meta=meta)
  File "/media/e706/disk_1/liuyanzhi/mmdetection-master/mmdet/apis/train.py", line 150, in train_detector
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/e706/anaconda3/envs/mmdetection/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/e706/anaconda3/envs/mmdetection/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
    self.call_hook('after_train_iter')
  File "/home/e706/anaconda3/envs/mmdetection/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/e706/anaconda3/envs/mmdetection/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 27, in after_train_iter
    runner.outputs['loss'].backward()
  File "/home/e706/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/e706/anaconda3/envs/mmdetection/lib/python3.7/site-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: grad_output must be contiguous

Config:

  roi_head=dict(
        type='StandardRoIHead',
        bbox_roi_extractor=dict(
            type='SingleRoIExtractor',
            roi_layer=dict(type='RoIAlign', output_size=7, sampling_ratio=0),
            out_channels=1024,
            featmap_strides=[4, 8, 16, 32],
            is_context=False,
            is_fusion=True),
        bbox_head=dict(
            type='Shared4Conv1FCBBoxHead',
            in_channels=1024,
            fc_out_channels=1024,
            roi_feat_size=7,
            num_classes=2,
            bbox_coder=dict(
                type='DeltaXYWHBBoxCoder',
                target_means=[0.0, 0.0, 0.0, 0.0],
                target_stds=[0.1, 0.1, 0.2, 0.2]),
            reg_class_agnostic=False,
            loss_cls=dict(
                type='FocalLoss',
                use_sigmoid=True,
                gamma=2,
                alpha=0.75,
                loss_weight=1.0),
            loss_bbox=dict(type='SmoothL1Loss', beta=1.0, loss_weight=1.0),
            conv_out_channels=256,
            norm_cfg=dict(type='BN', requires_grad=True)))

Code:

  for i in range(num_levels):
            mask = target_lvls == i
            inds = mask.nonzero(as_tuple=False).squeeze(1)
            # TODO: make it nicer when exporting to onnx
            if torch.onnx.is_in_onnx_export():
                # To keep all roi_align nodes exported to onnx
                rois_ = rois[inds]
                roi_feats_t = self.roi_layers[i](feats[i], rois_)
                roi_feats[inds] = roi_feats_t
                continue
            if inds.numel() > 0:
                rois_ = rois[inds]
                if self.is_fusion:
                    roi_feats_0 = self.roi_layers[0](feats[0], rois_)
                    roi_feats_1 = self.roi_layers[1](feats[1], rois_)
                    roi_feats_2 = self.roi_layers[2](feats[2], rois_)
                    roi_feats_3 = self.roi_layers[3](feats[3], rois_)
                    roi_feats_t = torch.cat([roi_feats_0, roi_feats_1, roi_feats_2, roi_feats_3], dim=1)
                    roi_feats[inds] = roi_feats_t
                else:
                    roi_feats_t = self.roi_layers[i](feats[i], rois_)
                    ##context###
                    if self.is_context:
                        context = torch.zeros(size=rois_.shape).cuda()
                        for j in range(len(rois_)):
                            w = rois_[j][3] - rois_[j][1]
                            h = rois_[j][4] - rois_[j][2]
                            context[j][0] = rois_[j][0]
                            context[j][1] = rois_[j][1] - w / 2
                            context[j][2] = rois_[j][2] - h / 2
                            context[j][3] = rois_[j][3] + w / 2
                            context[j][4] = rois_[j][4] + h / 2
                        factor = 0.5
                        context_feats = self.roi_layers[i](feats[i],context)
                        # context_feats_resize = torch.nn.MaxPool2d(2)(context_feats)
                        roi_feats[inds] = factor * roi_feats_t + (1 - factor) * context_feats
                        #############
                    else:
                        roi_feats[inds] = roi_feats_t
            else:
                roi_feats += sum(
                    x.view(-1)[0]
                    for x in self.parameters()) * 0. + feats[i].sum() * 0.
        return roi_feats

I look forward to a concrete and effective solution.

liuyanzhi1214 commented 3 years ago

It does not work. roi_featst = torch.cat([roi_feats_0, roi_feats_1, roi_feats_2, roi_feats_3], dim=1).contiguous()

jshilong commented 3 years ago

I encountered a similar problem before, you can register a backward hook to the suspicious tensors to make grad_out contiguous

liuyanzhi1214 commented 3 years ago

I encountered a similar problem before, you can register a backward hook to the suspicious tensors to make grad_out contiguous

Thank you very much for your answer, but I don't know much about backward_hook, can you explain how to solve this problem in detail. What puzzles me is that I just added the roi's channel through torch.cat(). Why does it cause the backward problem.

jshilong commented 3 years ago

Not sure if it is caused by cat operation, Perhaps the change caused by several potential operations. One point needs to be clarified, the contiguous tensor in the forward process does not necessarily lead to the continuous grad, I can give you an example

v = torch.tensor([0., 0., 0.], requires_grad=True)
h = v.register_hook(lambda grad: grad.contiguous())  # make the grad contiguous

You can perform this operation on all the suspicious tensors.

cizhenshi commented 3 years ago

Not sure if it is caused by cat operation, Perhaps the change caused by several potential operations. One point needs to be clarified, the contiguous tensor in the forward process does not necessarily lead to the continuous grad, I can give you an example
v = torch.tensor([0., 0., 0.], requires_grad=True)
h = v.register_hook(lambda grad: grad.contiguous())  # make the grad contiguous
You can perform this operation on all the suspicious tensors.

It worked for me. Thanks!

open-mmlab / mmdetection

the problem when change the out_put channel of roi #4596