Open CXY573 opened 6 years ago
@CXY573 what kind of error?
@jwyang Hi, I'm trying to use the code to train my data,but I got some error with the rpn_box_loss:
SomeTimes in the first print_log, rpn_box is not nan:
[session 1][epoch 1][iter 0/ 300] loss: 4.0862, lr: 1.00e-02 fg/bg=(58/4038), time cost: 10.137277 rpn_cls: 0.7353, rpn_box: 1.1108, rcnn_cls: 2.2178, rcnn
but in the next print_log, it will absolutely change to nan:
[session 1][epoch 1][iter 100/ 300] loss: nan, lr: 1.00e-02 fg/bg=(4096/0), time cost: 162.149158 rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan
And sometimes,it will occur nan in the first print_log.
I check the output of rpn_bbox_inside_weights, rpn_bbox_outside_weights, it is full of 0 of every element
So I don't know how to solve the problem
@CXY573 it is not because of the multi-gpu training. it might because there are some corner cases in your training data. Please refer to some previous posts about this issue.
@CXY573 I ran the same command as you did and I met some errors too. The detailed description is here . I don't know how to fix it. Could you please help me? Thank you!
CUDA_VISIBLE_DEVICES=0,1 python trainval_net.py --dataset pascal_voc --mGPUs --net vgg16 --bs 16 --nw 12 --lr 0.001 --cuda
我这样这设置多卡训练,但是报错了。。。 请问需要怎么设置,程序里写好了多卡并行么????