tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.5k stars 3.49k forks source link

Multi-GPU gives no speedup for transformer model #1110

Open stefan-falk opened 6 years ago

stefan-falk commented 6 years ago

Description

I am training a Transformer model on the Librispeech dataset using 4 GPUs with 8 CPU-cores.

I have tested the following:

Single-GPU

export CUDA_VISIBLE_DEVICES=0

t2t-trainer \
  --worker-gpu=1 \
  # ..

Multi-GPU

export CUDA_VISIBLE_DEVICES=0,1,2,3

t2t-trainer \
  --worker-gpu=4 \
  # ..

Both scripts are working. The training starts and on the surface everything looks okay. However, I am getting a global_step/sec or just ~2 steps for Multi-GPU, compared to ~9 steps for Single-GPU.

Shouldn't I see a speedup using multiple GPUs? If so: What might be the problem here? Can I trust the log output?


Environment information

OS: Linux #37~16.04.1-Ubuntu SMP Tue Aug 28 10:44:06 UTC 2018 GNU/Linux

$ pip freeze | grep tensor
tensor2tensor==1.9.0
tensorboard==1.10.0
tensorflow-gpu==1.10.1

$ python -V
Python 3.5.6 :: Anaconda, Inc.
martinpopel commented 6 years ago

This is expected, T2T uses synchronous training, so 1 step with 4 GPUs trains on 4 times bigger effective batch size. See e.g. this paper.

stefan-falk commented 6 years ago

@martinpopel Thank you for the link. What I understood is that I have to consider the effective throughput of batches, or samples per second, is that right?

If I take my example from above this would mean

Single: 9 gstep/s * 16 bsize          = 144 samples/s
Multi:  2 gstep/s * 16 bsize * 4 #gpu = 122 samples/s

It would seem that the additional overhead for this dataset does effectively slow down the training .. am I getting something wrong?

martinpopel commented 6 years ago

Yes, it seems the training throughput with 4 GPUs is lower than with a single GPU in your case. This is strange, usually it is higher (sublinear speedup). Maybe you have a slow interconnection between the GPUs (NVLink). I am also not familiar with Librispeech. However, larger batch size may lead to faster convergence on the dev-set, as explained in the paper I linked.

stefan-falk commented 6 years ago

@martinpopel In the meantime I ran another small experiment: Starting from scratch I had a loss of 1.7 after 2 minutes using one GPU - the same was true for 4 GPUs after 9 minutes. That's not a very scientific test but it seems that Multi-GPU does not help me here - for whatever reason. I'll check into that interconnection topic just to be sure.

lkluo commented 5 years ago

any update on using multi-GPUs?

stefan-falk commented 5 years ago

@lkluo Well, what I can say, based on observations, is that 4 GPUs let the model converge faster and overall better. I guess it is due to a larger batch size. From my experience I'd recommend using multiple GPUs and to use larger batch sizes.

isaprykin commented 4 years ago

@lkluo Well, what I can say, based on observations, is that 4 GPUs let the model converge faster and overall better. I guess it is due to a larger batch size. From my experience I'd recommend using multiple GPUs and to use larger batch sizes.

How long did it take you to reach SOTA on 4 GPUs?

shizhediao commented 2 years ago

same problems