jwyang / faster-rcnn.pytorch

A faster pytorch implementation of faster r-cnn
MIT License
7.69k stars 2.33k forks source link

rpn_box 0.0000 #541

Open luyaozhi opened 5 years ago

luyaozhi commented 5 years ago

when i train with voc2007, the rpn_box always 0.0000

[session 1][epoch 1][iter 0/5011] loss: 9.9775, lr: 1.00e-03 fg/bg=(7/249), time cost: 1.124771 rpn_cls: 0.6933, rpn_box: 0.0000, rcnn_cls: 3.0438, rcnn_box 0.0950, rcnn_cls_2nd: 3.0747, rcnn_box_2nd 0.0297, rcnn_cls_3rd: 3.0378, rcnn_box_3rd 0.0031 [session 1][epoch 1][iter 100/5011] loss: 4.2933, lr: 1.00e-03 fg/bg=(18/238), time cost: 88.347670 rpn_cls: 0.1459, rpn_box: 0.0000, rcnn_cls: 0.7563, rcnn_box 0.2220, rcnn_cls_2nd: 0.4431, rcnn_box_2nd 0.0312, rcnn_cls_3rd: 0.4980, rcnn_box_3rd 0.0130

can you help me with it?

AlexanderHustinx commented 5 years ago

What training command are you running? (e.g. python trainval_net.py --cuda --lr 0.004 --nw 1 --dataset pascal_voc --net res101 --lr_decay_step 8 --bs 4)

Have you made any modifications to the code? Which branch are you on? pytorch-0.4.0 or pytorch-1.0?

xwdu commented 5 years ago

I have the same problem. I use the branch pytorch-1.0, and I transfer the code > https://github.com/jwyang/fpn.pytorch

which is implemented by pytorch-0.4.0 to modify the faster rcnn code.

Besides the transfer of FPN code, I also modified the code of model/roi_layers/roi_align.py to support FPN like this: ` class ROIAlign(nn.Module): def init(self, output_size, spatial_scale, sampling_ratio): super(ROIAlign, self).init() self.output_size = output_size # 7×7 self.spatial_scale = spatial_scale # 1.0/16.0 self.sampling_ratio = sampling_ratio # 0

def forward(self, input, rois, scale):  # add parameter scale
    return roi_align(
        #input, rois, self.output_size, self.spatial_scale, self.sampling_ratio
        input, rois, self.output_size, scale, self.sampling_ratio
    )

` After training begins, the loss rpn_box is always 0.000, while other losses seem working fine. Can anyone help me with this issue? What's the reason behind this problem, is there any clue or method to trace or debug it ?

xwdu commented 5 years ago

The author only gives the implementation of FPN with pytorch-0.4.0, with seems only support cuda-8.0, since my GPU doesn't support cuda-8.0 and only support cuda-10 well, I have to transfer the code of FPN to faster-rcnn.pytorch-1.0.0. And I modified the code of model/roi_layers/roi_align, refering to the code of faster-rcnn.pytorch-0.4.0 and fpn.pytorch-0.4.0.

@AlexanderHustinx @jwyang

AlexanderHustinx commented 5 years ago

As a sanity check, have you tried to train the vanilla faster rcnn model from the pytorch-1.0.0 branch for a few epochs? You should in that case, if everything is set up correctly, not get a 0.000 loss

After that you can try debugging the modified code: e.g.

Also, how much of the code did you modify? I ask this because I believe this repository might use a slightly different RPN than the FPN repo.

If I recall correctly, the RPN box loss as well as other losses can become nan or 0.000 for several reasons, so pinpointing which one might take some debugging.

JungmoKoo commented 5 years ago

@luyaozhi fpn.pytorch project is made in pytorch 0.4.0. So you may be modify fpn code in your faster rcnn-pytorch-1.0. Am I right? If you apply "anchor_target_layer_fpn.py", you should modify some code.

Find the code below. positive_weights = 1.0 / num_examples negative_weights = 1.0 / num_examples

Modify the code as positive_weights = 1.0 / num_examples.item() negative_weights = 1.0 / num_examples.item()

If you use pytorch version 1.0 or higher, you should add ".item()" to use tensor variable.