Open Lqqqying opened 3 years ago
i have the similar problem, the rpn_cls and rpn_box are nan. have you solved the problem?
Maybe you can check whether there are boxes with "width=0" or "height=0" after preprocess
i hava the same problem,but i solved by this link:https://blog.csdn.net/forest_world/article/details/106034880
If your bbox label have neagtive value it will be calculated as (65536 + value), for example, if your xmin is -1, in this code it will be changed into 65535, which means xmin will bigger than xmax, this is because in "lib/datasets/pascal_voc.py" line 229 author use dtype=np.uint16 to save the value of boxes, force the value being positive, one thing we can do is to replace np.uint16 into np.int32, if your bbox label is negative, it should work.
rcnn_cls, rcnn_box loss becomes NAN when the iteteration comes to one hundred iters in first epoch. I use VOC2007 datasets without modify. My training command is : python trainval_net.py --dataset pascal_voc --net res101 --bs 24 --nw 8 --lr 0.001 --lr_decay_step 5 --cuda --mGPUs. other parameters are default.
[session 1][epoch 1][iter 0/ 417] loss: 54.1686, lr: 1.00e-03 fg/bg=(272/2800), time cost: 16.310677 rpn_cls: 1.9129, rpn_box: 0.5569, rcnn_cls: 51.1606, rcnn_box 0.5382 [session 1][epoch 1][iter 100/ 417] loss: nan, lr: 1.00e-03 fg/bg=(512/2560), time cost: 128.725597 rpn_cls: 0.6508, rpn_box: 0.1331, rcnn_cls: nan, rcnn_box nan [session 1][epoch 1][iter 200/ 417] loss: nan, lr: 1.00e-03 fg/bg=(493/2579), time cost: 128.850651 rpn_cls: 0.5904, rpn_box: 0.0281, rcnn_cls: nan, rcnn_box nan [session 1][epoch 1][iter 300/ 417] loss: nan, lr: 1.00e-03 fg/bg=(459/2613), time cost: 128.651249 rpn_cls: 0.5521, rpn_box: 0.0394, rcnn_cls: nan, rcnn_box nan
I have try remove the -1 from the pascal_voc.py file to avoid integer underflow, but it could not help me. Does anyone know what is the problem? Thanks a lot!