tensorflow / nmt

TensorFlow Neural Machine Translation Tutorial
Apache License 2.0
6.37k stars 1.96k forks source link

Multiple GPU makes training slower. #309

Open daisylab opened 6 years ago

daisylab commented 6 years ago

Hi, recently I began to study neural machine translation (seq2seq). Currently I'm reading a tutorial from TensorFlow.

https://www.tensorflow.org/tutorials/seq2seq

I ran the sample code from the site, i.e., the first NMT model, translating from Vietnamese to English. I have a machine with 2 nvidia 1080 Ti GPU. So I expected some speed-up from the two GPUs.

However, to my surprise, using two GPUs actually is slower than using one GPU.

Here's the command I used and the result.

python -m nmt.nmt \
    --src=vi --tgt=en \
    --vocab_prefix=/tmp/nmt_data/vocab  \
    --train_prefix=/tmp/nmt_data/train \
    --dev_prefix=/tmp/nmt_data/tst2012  \
    --test_prefix=/tmp/nmt_data/tst2013 \
    --out_dir=/tmp/nmt_model \
    --num_train_steps=12000 \
    --steps_per_stats=100 \
    --num_layers=2 \
    --num_units=128 \
    --dropout=0.2 \
    --metrics=bleu

And I could see step-time and wps as;

# Start step 0, lr 1, Tue Apr 17 21:24:40 2018
# Init train iterator, skipping 0 elements
  step 100 lr 1 step-time 0.11s wps 51.53K ppl 1733.09 gN 14.20 bleu 0.00, Tue Apr 17 21:24:51 2018
  step 200 lr 1 step-time 0.07s wps 83.14K ppl 545.74 gN 6.52 bleu 0.00, Tue Apr 17 21:24:58 2018
  step 300 lr 1 step-time 0.07s wps 83.20K ppl 348.48 gN 4.86 bleu 0.00, Tue Apr 17 21:25:05 2018
  step 400 lr 1 step-time 0.07s wps 83.55K ppl 255.14 gN 4.02 bleu 0.00, Tue Apr 17 21:25:11 2018
  step 500 lr 1 step-time 0.07s wps 82.43K ppl 221.13 gN 3.90 bleu 0.00, Tue Apr 17 21:25:18 2018

And when I used two GPUs;

python -m nmt.nmt \
    --src=vi --tgt=en \
    --vocab_prefix=/tmp/nmt_data/vocab  \
    --train_prefix=/tmp/nmt_data/train \
    --dev_prefix=/tmp/nmt_data/tst2012  \
    --test_prefix=/tmp/nmt_data/tst2013 \
    --out_dir=/tmp/nmt_model \
    --num_train_steps=12000 \
    --steps_per_stats=100 \
    --num_layers=2 \
    --num_units=128 \
    --dropout=0.2 \
    --metrics=bleu \
    --num_gpus=2

I got only

# Start step 0, lr 1, Tue Apr 17 21:27:03 2018
# Init train iterator, skipping 0 elements
  step 100 lr 1 step-time 0.39s wps 14.50K ppl 1960.62 gN 16.89 bleu 0.00, Tue Apr 17 21:27:42 2018
  step 200 lr 1 step-time 0.35s wps 15.94K ppl 538.04 gN 5.75 bleu 0.00, Tue Apr 17 21:28:17 2018
  step 300 lr 1 step-time 0.35s wps 15.93K ppl 371.06 gN 5.27 bleu 0.00, Tue Apr 17 21:28:52 2018
  step 400 lr 1 step-time 0.35s wps 15.94K ppl 271.32 gN 4.45 bleu 0.00, Tue Apr 17 21:29:27 2018
  step 500 lr 1 step-time 0.36s wps 15.99K ppl 221.52 gN 3.69 bleu 0.00, Tue Apr 17 21:30:03 2018

So my question is:

  1. Is this a normal behavior? I think maybe there's a problem in the parallelism, or,
  2. Am I missing any important thing?

I'm glad to hear about you.

Best regards, sungjin

tslater commented 5 years ago

@daisylab, did you ever figure out your issue? or has it still been slow?

mohammedayub44 commented 5 years ago

@daisylab @tslater I have kind of similar issues, Are there any optimizations on the GPU that you can do to speed it up. I'm using the 4 Tesla V100 sxm2 16gb GPU's and the train step time & utilization is as below:

Step Time: 0.54s image

Under Utilized: image

Looping some folks here - @bastings @oahziur

Appreciate any help.Thanks.

yzchen commented 5 years ago

I think there several reasons for this issue,

  1. maybe you should set the batch_size as large as possible to utilize your GPU

  2. when you specify num_gpus, nmt model will split the model, put different part in different GPU, this is not data parallelism, it's workload partitioning, you can make a data parallel training version instead of using num_gpus flag