tryolabs / luminoth

Deep Learning toolkit for Computer Vision.
https://tryolabs.com
BSD 3-Clause "New" or "Revised" License
2.4k stars 399 forks source link

thanks! how to train fasterrcnn with multiple GPUs on a server? #172

Open zzk88862 opened 6 years ago

npeirson commented 6 years ago

I could be wrong, but I believe you can specify distribution parameters. I don't know for local, but for training on the google cloud, you'd just tack this on to your command line: --worker-count (number of GPUs, or greater if you wish to have more than one worker per GPU)

zzk88862 commented 6 years ago

ok, thanks your answer, What I mean is how to run a single server with multi-card training, not distributed training

npeirson commented 6 years ago

I've only just started exploring Luminoth, but since it's still alpha I'm going to guess you'll need to interact directly with Tensorflow to do that. That being said, I don't think it's terribly difficult; pretty much replacing your single CPU or GPU call with a for gpu in [gpu-1, gpu-2, gpu-3, ... gpu-n]: or similar call. Check out this page for an example.

zzk88862 commented 6 years ago

okay, thanks for your answer, I will try it

zzk88862 commented 6 years ago

thank your advice, i have tried multiple gpu by with tf.device('/gpu:i'%i), but it always hits 已杀死, i have debugged many times, but it not solved! followings are some my run messages

1、nvdia messages +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.111 Driver Version: 384.111 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P4 Off | 00000000:81:00.0 Off | 0 | | N/A 61C P0 24W / 75W | 4387MiB / 7606MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla P4 Off | 00000000:82:00.0 Off | 0 | | N/A 62C P0 24W / 75W | 4881MiB / 7606MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 24215 C /root/anaconda3/envs/new52/bin/python 4369MiB | | 1 24215 C /root/anaconda3/envs/new52/bin/python 4863MiB | +-----------------------------------------------------------------------------+

2、run result image