jwyang / faster-rcnn.pytorch

A faster pytorch implementation of faster r-cnn
MIT License
7.7k stars 2.33k forks source link

怎么使用多卡训练 #333

Open CXY573 opened 6 years ago

CXY573 commented 6 years ago

CUDA_VISIBLE_DEVICES=0,1 python trainval_net.py --dataset pascal_voc --mGPUs --net vgg16 --bs 16 --nw 12 --lr 0.001 --cuda

我这样这设置多卡训练,但是报错了。。。 请问需要怎么设置,程序里写好了多卡并行么????

jwyang commented 6 years ago

@CXY573 what kind of error?

CXY573 commented 6 years ago

@jwyang Hi, I'm trying to use the code to train my data,but I got some error with the rpn_box_loss:

SomeTimes in the first print_log, rpn_box is not nan:

[session 1][epoch 1][iter 0/ 300] loss: 4.0862, lr: 1.00e-02 fg/bg=(58/4038), time cost: 10.137277 rpn_cls: 0.7353, rpn_box: 1.1108, rcnn_cls: 2.2178, rcnn

but in the next print_log, it will absolutely change to nan:

[session 1][epoch 1][iter 100/ 300] loss: nan, lr: 1.00e-02 fg/bg=(4096/0), time cost: 162.149158 rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan

And sometimes,it will occur nan in the first print_log.

I check the output of rpn_bbox_inside_weights, rpn_bbox_outside_weights, it is full of 0 of every element

So I don't know how to solve the problem

jwyang commented 6 years ago

@CXY573 it is not because of the multi-gpu training. it might because there are some corner cases in your training data. Please refer to some previous posts about this issue.

TonyTangYu commented 5 years ago

@CXY573 I ran the same command as you did and I met some errors too. The detailed description is here . I don't know how to fix it. Could you please help me? Thank you!