rcnn_cls: nan, rcnn_box nan while training

Lqqqying commented 3 years ago

rcnn_cls, rcnn_box loss becomes NAN when the iteteration comes to one hundred iters in first epoch. I use VOC2007 datasets without modify. My training command is : python trainval_net.py --dataset pascal_voc --net res101 --bs 24 --nw 8 --lr 0.001 --lr_decay_step 5 --cuda --mGPUs. other parameters are default.

[session 1][epoch 1][iter 0/ 417] loss: 54.1686, lr: 1.00e-03 fg/bg=(272/2800), time cost: 16.310677 rpn_cls: 1.9129, rpn_box: 0.5569, rcnn_cls: 51.1606, rcnn_box 0.5382 [session 1][epoch 1][iter 100/ 417] loss: nan, lr: 1.00e-03 fg/bg=(512/2560), time cost: 128.725597 rpn_cls: 0.6508, rpn_box: 0.1331, rcnn_cls: nan, rcnn_box nan [session 1][epoch 1][iter 200/ 417] loss: nan, lr: 1.00e-03 fg/bg=(493/2579), time cost: 128.850651 rpn_cls: 0.5904, rpn_box: 0.0281, rcnn_cls: nan, rcnn_box nan [session 1][epoch 1][iter 300/ 417] loss: nan, lr: 1.00e-03 fg/bg=(459/2613), time cost: 128.651249 rpn_cls: 0.5521, rpn_box: 0.0394, rcnn_cls: nan, rcnn_box nan

I have try remove the -1 from the pascal_voc.py file to avoid integer underflow, but it could not help me. Does anyone know what is the problem? Thanks a lot!

Fuhan1994 commented 3 years ago

i have the similar problem, the rpn_cls and rpn_box are nan. have you solved the problem?

CheungBH commented 3 years ago

Maybe you can check whether there are boxes with "width=0" or "height=0" after preprocess

a-zhenzhen commented 3 years ago

i hava the same problem,but i solved by this link:https://blog.csdn.net/forest_world/article/details/106034880

Jason-user commented 1 year ago

If your bbox label have neagtive value it will be calculated as (65536 + value), for example, if your xmin is -1, in this code it will be changed into 65535, which means xmin will bigger than xmax, this is because in "lib/datasets/pascal_voc.py" line 229 author use dtype=np.uint16 to save the value of boxes, force the value being positive, one thing we can do is to replace np.uint16 into np.int32, if your bbox label is negative, it should work.

jwyang / faster-rcnn.pytorch

rcnn_cls: nan, rcnn_box nan while training #864