GPU-Util is low When use multi-GPUs

nouhadziri / THRED

The implementation of the paper "Augmenting Neural Response Generation with Context-Aware Topical Attention"

https://arxiv.org/abs/1811.01063

MIT License

111 stars 25 forks source link

GPU-Util is low When use multi-GPUs #26

Open LTlitong opened 4 years ago

LTlitong commented 4 years ago

Hello,

I want to train on multi-GPUs, and I try 8, 4 and 2 gpus. But the GPU-Util of some gpus are low, almost 0%. An epoch training time on 8 gpus is almost 20 minutes longer than on a single gpu.

Your code sets the GPU default num as 4. But when I try 4 cards, there is also one card's GPU-Util always 0%. There is no 0% GPU-Util on the two cards, but the GPU-Util of one of the cards is still 20%. This is GPU Usage when training on 4 cards:

I am not very clear about shard. I want to ask whether need to modify the code to train on multi-GPUs and accelerate the training ?

Looking forward to your reply！

ehsk commented 4 years ago

You mentioned you ran the code with 1 or 2 GPUs. Did you have this problem in those runs too? I suggest turning on log_device in the config file and compare the single GPU run with 4/8 GPUs run.

I haven't had this problem before, although GPU-util was around 50-60% for all GPUs.

LTlitong commented 4 years ago

Thanks for your reply！

The GPU-util was 70-80% when run with 1 GPU. And it was 50% and 20% respectively when run with 2 GPUs. But there is always a gpu which GPU-util is 0% all the time. I turn on log_device to get the device mapping, and I have sent you an email.
Moreover, I also wanna ask whether your experiment results in paper are averaged over 3 datasets(3/4/5 turn Reddit)? Because I run all epochs but the result is different from the paper. Could you please provide your results on each dataset?

ehsk commented 4 years ago

Sorry for the late reply.

Have you set CUDA_VISIBLE_DEVICES? Based on the log you sent, no tensor was assigned to one of the GPUs.
All the results in the paper are reported based on the 3-turn dataset.