Closed AzizCode92 closed 6 years ago
I'll take a look
Hi @vrenkens, are there any coming updates for this issue?
Yes, I was working on something else, but hopefully I can work on it this week
Hi, there was an error in the parameter server that caused it to crash. This could be the reason for your problem. I now have 2 GPUs that run:
Tue Aug 14 10:07:15 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130 Driver Version: 384.130 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 980 Ti On | 00000000:01:00.0 Off | N/A |
| 59% 72C P2 98W / 250W | 5807MiB / 6077MiB | 49% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 980 Ti On | 00000000:02:00.0 Off | N/A |
| 49% 65C P2 109W / 250W | 5809MiB / 6078MiB | 48% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 18548 C python 5794MiB |
| 1 18549 C python 5597MiB |
The output looks like this:
WORKER 0: step 16/15600 loss: 3.249516, learning rate: 0.000998
time elapsed: 0.635164 sec
peak memory usage: 886/5883 MB
WORKER 1: step 17/15600 loss: 3.216467, learning rate: 0.000997
time elapsed: 13.911817 sec
peak memory usage: 886/5883 MB
WORKER 0: step 17/15600 loss: 3.250657, learning rate: 0.000997
time elapsed: 1.881364 sec
peak memory usage: 886/5883 MB
WORKER 1: step 19/15600 loss: 3.223689, learning rate: 0.000997
time elapsed: 2.540161 sec
peak memory usage: 886/5883 MB
Could you pull the newest version and try again?
Sorry it took so long!
Cheers, Vincens
Cool, I will try the new changes and let you updated if it will work for me. Thank you Aziz
It works fine now. thank you
I was trying to run the Single Machine Mode where I have 3 GPU. I specified under config/singlemachine.cfg the ID of the GPUs and the number of the workers. But then when I started the training, I saw that the tensorflow allocated memory to all GPUs but just one GPU was executing the process.