mulitiple/Parallelism gpu training work

deokisys commented 6 years ago

hi I tried '1 gpu 2,000,000 iterator training' but it takes more then 50days so I want to train 'multi-gpu training'.

My computer has 4 gpu(gtx 1080ti * 4)

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.111 Driver Version: 384.111 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 108... Off | 00000000:05:00.0 On | N/A | | 29% 47C P8 18W / 250W | 606MiB / 11170MiB | 6% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 108... Off | 00000000:06:00.0 Off | N/A | | 53% 84C P2 218W / 250W | 10888MiB / 11172MiB | 95% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A | | 48% 80C P2 167W / 250W | 10888MiB / 11172MiB | 33% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce GTX 108... Off | 00000000:0A:00.0 Off | N/A | | 23% 31C P8 8W / 250W | 11MiB / 11172MiB | 0% Default | +-------------------------------+----------------------+----------------------+

I tried ./experiments/scripts/faster_rcnn_end2end.sh 0,1,2,3 VGG16 pascal_voc ./experiments/scripts/faster_rcnn_end2end.sh 0:3 VGG16 pascal_voc ./experiments/scripts/faster_rcnn_end2end.sh {0,1,2,3} VGG16 pascal_voc but not working

Can you tell me about how to use multi-gpu training? I want to know detail step of that. Thank you.

nqanh commented 6 years ago

Currently, some python layers from Faster R-CNN do not support multi gpu training. You may want to check this repo to know how to do this. The integration process may be very complicated. Unfortunately, I don't have access to multi gpu to test it. Good luck!

deokisys commented 6 years ago

@nqanh what's you're cudnn and cuda version? I want to make caffe USE_CUDNN := 1 is not working. I think that is cudnn version problem.

nqanh commented 6 years ago

I'm using cuda 8 and cudnn 5. I can build with cudnn without any problem. You may want check which cudnn version is ok for your cuda on Caffe site.

deokisys commented 6 years ago

@nqanh Please forgive me for many questioning. I using cudnn5,5.1 above, but it doesn't work, then i use cudnn 4 that worked. and trainig time about 1iterator is almost <1 second. then training time is almost 30days. when you training this for 2,000,000iterator, how long did it takes in your computer?

nqanh commented 6 years ago

@deokisys No problems! Just tell us if you have any problems with the code.

About the training time, it depends on your hardware and how big your dataset is. For example, for IIT-AFF dataset, we have around 6K training images and training for 200000 (not 2, 000, 000 as you said) takes around 2 days on a Titan X.

deokisys commented 6 years ago

@nqanh thank you I missed it. and I have loss problem.

I0125 21:00:01.081059 2920 solver.cpp:229] Iteration 160, loss = nan I0125 21:00:01.081090 2920 solver.cpp:245] Train net output #0: loss_bbox = nan ( 2 = nan loss) I0125 21:00:01.081097 2920 solver.cpp:245] Train net output #1: loss_cls = 87.3365 ( 3 = 262.01 loss) I0125 21:00:01.081102 2920 solver.cpp:245] Train net output #2: loss_mask = 25.5312 ( 3 = 76.5937 loss) I0125 21:00:01.081107 2920 solver.cpp:245] Train net output #3: rpn_cls_loss = 0.538947 ( 1 = 0.538947 loss) I0125 21:00:01.081112 2920 solver.cpp:245] Train net output #4: rpn_loss_bbox = 0.0324526 (* 1 = 0.0324526 loss) I0125 21:00:01.081117 2920 sgd_solver.cpp:106] Iteration 160, lr = 0.001

if i make caffe without cudnn , then training speed is down then with cudnn (1 iterator is almos 2s), but loss is going down. if i make caffe with cudnn, then training speed is going up (1 iterator is almost 0.8s) , but loss is 'nan'. is not good? loss 'nan' mean about not training?

nqanh commented 6 years ago

Loss=nan is a serious problem. Sometimes we have this problem due to the numerical explosion in some python layers (from Faster R-CNN). It's quite random and also depends on your GPU (and cuda). You should stop the training and try to avoid this problem because the network is dead if loss=nan. Also, we do not use cuDNN during training.

nqanh / affordance-net

mulitiple/Parallelism gpu training work #17