Closed deokisys closed 5 years ago
Currently, some python layers from Faster R-CNN do not support multi gpu training. You may want to check this repo to know how to do this. The integration process may be very complicated. Unfortunately, I don't have access to multi gpu to test it. Good luck!
@nqanh what's you're cudnn and cuda version?
I want to make caffe USE_CUDNN := 1
is not working. I think that is cudnn version problem.
I'm using cuda 8 and cudnn 5. I can build with cudnn without any problem. You may want check which cudnn version is ok for your cuda on Caffe site.
@nqanh Please forgive me for many questioning. I using cudnn5,5.1 above, but it doesn't work, then i use cudnn 4 that worked. and trainig time about 1iterator is almost <1 second. then training time is almost 30days. when you training this for 2,000,000iterator, how long did it takes in your computer?
@deokisys No problems! Just tell us if you have any problems with the code.
About the training time, it depends on your hardware and how big your dataset is. For example, for IIT-AFF dataset, we have around 6K training images and training for 200000 (not 2, 000, 000 as you said) takes around 2 days on a Titan X.
@nqanh thank you I missed it. and I have loss problem.
I0125 21:00:01.081059 2920 solver.cpp:229] Iteration 160, loss = nan I0125 21:00:01.081090 2920 solver.cpp:245] Train net output #0: loss_bbox = nan ( 2 = nan loss) I0125 21:00:01.081097 2920 solver.cpp:245] Train net output #1: loss_cls = 87.3365 ( 3 = 262.01 loss) I0125 21:00:01.081102 2920 solver.cpp:245] Train net output #2: loss_mask = 25.5312 ( 3 = 76.5937 loss) I0125 21:00:01.081107 2920 solver.cpp:245] Train net output #3: rpn_cls_loss = 0.538947 ( 1 = 0.538947 loss) I0125 21:00:01.081112 2920 solver.cpp:245] Train net output #4: rpn_loss_bbox = 0.0324526 (* 1 = 0.0324526 loss) I0125 21:00:01.081117 2920 sgd_solver.cpp:106] Iteration 160, lr = 0.001
if i make caffe without cudnn , then training speed is down then with cudnn (1 iterator is almos 2s), but loss is going down. if i make caffe with cudnn, then training speed is going up (1 iterator is almost 0.8s) , but loss is 'nan'. is not good? loss 'nan' mean about not training?
Loss=nan is a serious problem. Sometimes we have this problem due to the numerical explosion in some python layers (from Faster R-CNN). It's quite random and also depends on your GPU (and cuda). You should stop the training and try to avoid this problem because the network is dead if loss=nan. Also, we do not use cuDNN during training.
hi I tried '1 gpu 2,000,000 iterator training' but it takes more then 50days so I want to train 'multi-gpu training'.
My computer has 4 gpu(gtx 1080ti * 4)
I tried
./experiments/scripts/faster_rcnn_end2end.sh 0,1,2,3 VGG16 pascal_voc
./experiments/scripts/faster_rcnn_end2end.sh 0:3 VGG16 pascal_voc
./experiments/scripts/faster_rcnn_end2end.sh {0,1,2,3} VGG16 pascal_voc
but not workingCan you tell me about how to use multi-gpu training? I want to know detail step of that. Thank you.