vrenkens / nabu

Code for end-to-end ASR with neural networks, build with TensorFlow
MIT License
108 stars 43 forks source link

Single Machine Mode with Multi-GPU #39

Closed AzizCode92 closed 6 years ago

AzizCode92 commented 6 years ago

I was trying to run the Single Machine Mode where I have 3 GPU. I specified under config/singlemachine.cfg the ID of the GPUs and the number of the workers. But then when I started the training, I saw that the tensorflow allocated memory to all GPUs but just one GPU was executing the process.

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 387.26 Driver Version: 387.26 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 Off | 00000000:00:19.0 Off | 0 | | N/A 62C P0 64W / 149W | 10896MiB / 11439MiB | 21% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K80 Off | 00000000:00:1A.0 Off | 0 | | N/A 49C P0 69W / 149W | 10873MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla K80 Off | 00000000:00:1B.0 Off | 0 | | N/A 55C P0 57W / 149W | 10873MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| +-----------------------------------------------------------------------------+

vrenkens commented 6 years ago

I'll take a look

AzizCode92 commented 6 years ago

Hi @vrenkens, are there any coming updates for this issue?

vrenkens commented 6 years ago

Yes, I was working on something else, but hopefully I can work on it this week

vrenkens commented 6 years ago

Hi, there was an error in the parameter server that caused it to crash. This could be the reason for your problem. I now have 2 GPUs that run:

Tue Aug 14 10:07:15 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130                Driver Version: 384.130                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980 Ti  On   | 00000000:01:00.0 Off |                  N/A |
| 59%   72C    P2    98W / 250W |   5807MiB /  6077MiB |     49%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 980 Ti  On   | 00000000:02:00.0 Off |                  N/A |
| 49%   65C    P2   109W / 250W |   5809MiB /  6078MiB |     48%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     18548      C   python                                      5794MiB |
|    1     18549      C   python                                      5597MiB |

The output looks like this:

WORKER 0: step 16/15600 loss: 3.249516, learning rate: 0.000998 
     time elapsed: 0.635164 sec
     peak memory usage: 886/5883 MB
WORKER 1: step 17/15600 loss: 3.216467, learning rate: 0.000997 
     time elapsed: 13.911817 sec
     peak memory usage: 886/5883 MB
WORKER 0: step 17/15600 loss: 3.250657, learning rate: 0.000997 
     time elapsed: 1.881364 sec
     peak memory usage: 886/5883 MB
WORKER 1: step 19/15600 loss: 3.223689, learning rate: 0.000997 
     time elapsed: 2.540161 sec
     peak memory usage: 886/5883 MB

Could you pull the newest version and try again?

Sorry it took so long!

Cheers, Vincens

AzizCode92 commented 6 years ago

Cool, I will try the new changes and let you updated if it will work for me. Thank you Aziz

AzizCode92 commented 6 years ago

It works fine now. thank you