Open zzk88862 opened 6 years ago
ok, thanks your answer, What I mean is how to run a single server with multi-card training, not distributed training
I've only just started exploring Luminoth, but since it's still alpha I'm going to guess you'll need to interact directly with Tensorflow to do that. That being said, I don't think it's terribly difficult; pretty much replacing your single CPU or GPU call with a for gpu in [gpu-1, gpu-2, gpu-3, ... gpu-n]:
or similar call. Check out this page for an example.
okay, thanks for your answer, I will try it
thank your advice, i have tried multiple gpu by with tf.device('/gpu:i'%i), but it always hits 已杀死, i have debugged many times, but it not solved! followings are some my run messages
1、nvdia messages +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.111 Driver Version: 384.111 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P4 Off | 00000000:81:00.0 Off | 0 | | N/A 61C P0 24W / 75W | 4387MiB / 7606MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla P4 Off | 00000000:82:00.0 Off | 0 | | N/A 62C P0 24W / 75W | 4881MiB / 7606MiB | 0% Default | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 24215 C /root/anaconda3/envs/new52/bin/python 4369MiB | | 1 24215 C /root/anaconda3/envs/new52/bin/python 4863MiB | +-----------------------------------------------------------------------------+
2、run result
I could be wrong, but I believe you can specify distribution parameters. I don't know for local, but for training on the google cloud, you'd just tack this on to your command line: --worker-count (number of GPUs, or greater if you wish to have more than one worker per GPU)