Open luyaozhi opened 5 years ago
What training command are you running? (e.g. python trainval_net.py --cuda --lr 0.004 --nw 1 --dataset pascal_voc --net res101 --lr_decay_step 8 --bs 4
)
Have you made any modifications to the code? Which branch are you on? pytorch-0.4.0 or pytorch-1.0?
I have the same problem. I use the branch pytorch-1.0, and I transfer the code > https://github.com/jwyang/fpn.pytorch
which is implemented by pytorch-0.4.0 to modify the faster rcnn code.
Besides the transfer of FPN code, I also modified the code of model/roi_layers/roi_align.py to support FPN like this: ` class ROIAlign(nn.Module): def init(self, output_size, spatial_scale, sampling_ratio): super(ROIAlign, self).init() self.output_size = output_size # 7×7 self.spatial_scale = spatial_scale # 1.0/16.0 self.sampling_ratio = sampling_ratio # 0
def forward(self, input, rois, scale): # add parameter scale
return roi_align(
#input, rois, self.output_size, self.spatial_scale, self.sampling_ratio
input, rois, self.output_size, scale, self.sampling_ratio
)
` After training begins, the loss rpn_box is always 0.000, while other losses seem working fine. Can anyone help me with this issue? What's the reason behind this problem, is there any clue or method to trace or debug it ?
The author only gives the implementation of FPN with pytorch-0.4.0, with seems only support cuda-8.0, since my GPU doesn't support cuda-8.0 and only support cuda-10 well, I have to transfer the code of FPN to faster-rcnn.pytorch-1.0.0. And I modified the code of model/roi_layers/roi_align, refering to the code of faster-rcnn.pytorch-0.4.0 and fpn.pytorch-0.4.0.
@AlexanderHustinx @jwyang
As a sanity check, have you tried to train the vanilla faster rcnn model from the pytorch-1.0.0 branch for a few epochs? You should in that case, if everything is set up correctly, not get a 0.000 loss
After that you can try debugging the modified code: e.g.
Also, how much of the code did you modify? I ask this because I believe this repository might use a slightly different RPN than the FPN repo.
If I recall correctly, the RPN box loss as well as other losses can become nan or 0.000 for several reasons, so pinpointing which one might take some debugging.
@luyaozhi fpn.pytorch project is made in pytorch 0.4.0. So you may be modify fpn code in your faster rcnn-pytorch-1.0. Am I right? If you apply "anchor_target_layer_fpn.py", you should modify some code.
Find the code below.
positive_weights = 1.0 / num_examples
negative_weights = 1.0 / num_examples
Modify the code as
positive_weights = 1.0 / num_examples.item()
negative_weights = 1.0 / num_examples.item()
If you use pytorch version 1.0 or higher, you should add ".item()" to use tensor variable.
when i train with voc2007, the rpn_box always 0.0000
[session 1][epoch 1][iter 0/5011] loss: 9.9775, lr: 1.00e-03 fg/bg=(7/249), time cost: 1.124771 rpn_cls: 0.6933, rpn_box: 0.0000, rcnn_cls: 3.0438, rcnn_box 0.0950, rcnn_cls_2nd: 3.0747, rcnn_box_2nd 0.0297, rcnn_cls_3rd: 3.0378, rcnn_box_3rd 0.0031 [session 1][epoch 1][iter 100/5011] loss: 4.2933, lr: 1.00e-03 fg/bg=(18/238), time cost: 88.347670 rpn_cls: 0.1459, rpn_box: 0.0000, rcnn_cls: 0.7563, rcnn_box 0.2220, rcnn_cls_2nd: 0.4431, rcnn_box_2nd 0.0312, rcnn_cls_3rd: 0.4980, rcnn_box_3rd 0.0130
can you help me with it?