tensorflow / mesh

Mesh TensorFlow: Model Parallelism Made Easier
Apache License 2.0
1.59k stars 255 forks source link

Performance on GPUs and multiple GPU support #80

Open nict-wisdom opened 4 years ago

nict-wisdom commented 4 years ago

We tried to run Mesh-TensorFlow to train T5 on GPUs following the instructions on T5's repository, but the training is extremely slow.

global_step/sec: 0.0467347 examples/sec: 0.186939

The training script successfully detected GPUs (showing "Adding visible gpu devices: ..."), but most of computation seems to run on a CPU. By enabling log_device_placement, we can see many operators on both CPUs and GPUs. ProfilerHook showed that it actually uses both, but I couldn't know if the behavior is expected or not.

I am wondering if Mesh-TensorFlow runs on GPUs in a practical sense. I found an issue that mentioned a similar problem, but it was closed with no answer (#35).

I also failed to find reliable documents about training on multiple GPUs. An existing issue #20 mentioned the same question, but no answer was given.

I appreciate if someone could give us any information regarding the above questions.

mcompute commented 4 years ago

Facing the same issue.

LiweiPeng commented 4 years ago

facing same issue. can someone share some answers for this?

knagrecha commented 4 years ago

Also seeing this issue. Monitoring GPU usage shows that only one GPU is being utilized when running BERT.

xdgarrido commented 4 years ago

The current MNIST example is just using a single GPU in AMD/RocM platforms.

PSZehnder commented 4 years ago

I can run the mnist example on a GPU. Does not appear to be utilizing CPU resources. However, when using 4 GPU, only the first device is actually utilized.

Hopefully we can get a developer response on this... ~I can't see what would need to be modified in mnist.py to make distributed GPU training work.~

EDIT: specifying your devices by name ['gpu:0, 'gpu:1', 'gpu:2'] instead of [''] * mesh_size solves the problem for me

assij commented 4 years ago

@PSZehnder Does mesh tensorflow supports multi node training ( i.e. each node has #x GPUs attached to it)? I'm using 2 nodes each with 8 GPUs and would like to train on the entire (2 nodes *8 gpus )=16 GPUs. How do I configure mesh tensorflow to train in a multi node setup?

assij commented 4 years ago

@nshazeer Does mesh tensorflow supports multi node training ( i.e. each node has #x GPUs attached to it)? I'm using 2 nodes each with 8 GPUs and would like to train on the entire (2 nodes *8 gpus )=16 GPUs. How do I configure mesh tensorflow to train in a multi node setup?

nshazeer commented 4 years ago

Yes, that should be possible, though I haven't done it. The GPU code just relies on device placement, so if you can construct a TF graph which can name all of the 16 GPUs as different devices, it should work...

assij commented 4 years ago

@nshazeer , Thanks for your reply. If I can make the 16 GPUs visible ,How the data loading will be done in a 2 node * 8 GPUs ? Will the data be loaded through 1 CPU in node0 ( where I run the script, so 1 CPU sends data to 16 GPUs) or the data loading will be done from the 2 cpus ( node0 and node1), so each CPU sends the data which is relevant to the 8 GPUs it its connected to. ?

zaccharieramzi commented 3 years ago

@nict-wisdom do you have a snippet showing how you used the ProfilerHook, I am a bit struggling with it atm.

weberxie commented 3 years ago

Met the same problem, anyone on this team can reply this issue?

Conformist101 commented 3 years ago

We are also facing the same issue. Any help in this context will be highly appreciated.