tensorflow / nmt

TensorFlow Neural Machine Translation Tutorial
Apache License 2.0
6.37k stars 1.96k forks source link

Insights on model parallel with LSTMs on separate GPUs #272

Open willzyz opened 6 years ago

willzyz commented 6 years ago

Hi NMT authors,

I have a question on model parallel by placing LSTMs on separate GPUs. I tested 1 GPU vs 4 GPUs on 4 lstm layers in seq2seq model (pci-e for GPU/RAM comms):

Step-time, Model parallel (4 GPUs): 0.14s Step-time, Single GPU: 0.10s

Multi-gpu is actually 40% slower likely due to lack of parallelization on short sequences with CPU-GPU comms overhead. On Seq2Seq (no vector/learning-to-rank objective), on 128-minibatch sequences, [4 layers, 512 dimension LSTM model, query-keyword data (~average 4-5 words per sequence)]

I like to confirm the rational in these papers: https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf https://arxiv.org/pdf/1609.08144.pdf

Seq2Seq model-parallelize in the way as follows:

consider a 3 layer lstm encoder on sequence len of 4: 1-> 2 -> 3 -> 4 (gpu2) 5-> 6 -> 7 -> 8 (gpu1) 9 -> 10 -> 11 -> 12 (gpu0) w1, w2, w3, w4

The model parallel will process w1-> 9 on gpu0, followed by 5, 10 concurrently by gpu0, gpu1, then 1, 6, 11 concurrently by gpu0, gpu1, gpu2. Eventually, it is a non-linear speedup w.r.t. # GPUs, in fact the speed-up is linear only if sequence length is large, it will be marginal speedup for short sequences.

The comms overhead is high, due to the # of tensors communicated across many layers and longer sequences is 10Mb level.

It'd be nice to confirm the above analysis.

Thanks very much.

mohammedayub44 commented 5 years ago

@willzyz I'm getting similar kind of trend while trying out CPU, 1 GPU and 4GPU on 8 layer architecture. Do you happen to know why multi-gpu has more per Step time than single GPU. or any other optimizations you have tried ?

Appreciate any help

-Mohammed Ayub