Abnormal memory allocation of transformer with GPipe on V100-16GB

Codebase Version:

Lingvo: https://github.com/tensorflow/lingvo/tree/tf1.15
Docker: build from https://github.com/tensorflow/lingvo/blob/tf1.15/docker/dev.dockerfile

Hardward config

V100-16GB x 8 with NVLink

Problem Description For default hyper-parameter as described here, direct use of these configurations will cause OOM on V100-16GB.

The key model hyper-params are listed as follows, which are actually aligned with configs in Table 1 of GPipe paper:

BATCH_SIZE= 32
VOCAB_SIZE = 32000
EMBEDDING_DIM = 2048
max_sequence_length= 1024
Optimizer: RMSProp

The running script is as follows.

num_gpu=1  
./bazel-bin/lingvo/trainer  \
  --run_locally=gpu --mode=sync \
  --model=lm.one_billion_wds.OneBWdsGPipeTransformerWPM \
  --logdir=/tmp/lm/log/ \
  --logtostderr \
  --worker_split_size=${num_gpu} \
  --worker_gpus=${num_gpu} \

Actually experiments in my config shows that the maximum transformer layer per GPU before OOM is only 1 with other model hyper-configurations kept unchanged (while the transformer layer per GPU is 8 as described in the codebase), which is confusing.

Then I tune BATCH_SIZE from 1 to 32 and breakdown the actual total peak GPU memory consumption as follows:

In summary, based on the measured data listed above, it seems that: (1) The re-computation feature of forward layer when needed seems not working as expected; (2) Even with (1) disabled, the native memory consumption of this transformer is much larger than that reported in codebase (drop from 8 to 1 transformer_layer_per_gpu before OOM).

Thanks in advance for any help! @bignamehyp

Detailed breakdown of memory allocation from tfprof is listed as follows:

BATCH_SIZE=32 VOCAB_SIZE = 32000 EMBEDDING_DIM = 2048 MAX_SEQUENCE_LENGTH = 1024

node name | requested bytes | total execution time | accelerator execution time | cpu execution time _TFProfRoot (--/17327.34MB, --/6.22sec, --/1.36sec, --/4.86sec) • 1bwds_wpm_level_lm/ (0B/725.85MB, 0us/2.76ms, 0us/0us, 0us/2.76ms)

transformerlm/ (0B/725.85MB, 0us/2.75ms, 0us/0us, 0us/2.75ms)
- emb/src_token_emb/wm/var (262.14MB/262.14MB, 12us/12us, 0us/0us, 12us/12us)
  - encoder_0 (0B/201.43MB, 0us/301us, 0us/0us, 0us/301us)
  - softmax (0B/262.27MB, 0us/44.03ms, 0us/0us, 0us/44.03ms) • Empty_2 (4194.30MB/4194.30MB, 5.16ms/5.16ms, 5.00ms/5.00ms, 150us/150us)
==> Empty_2 size: FLOAT32, [32, 1024, 1, 32000] • Empty_4 (262.14MB/262.14MB, 407us/407us, 333us/333us, 74us/74us) • TransformAndSum_101 (0B/816.22MB, 0us/668.99ms, 0us/331.99ms, 0us/337.00ms) • TransformAndSum_102 (0B/725.86MB, 0us/82.32sec, 0us/0us, 0us/82.32sec) • arg72 (131.07MB/131.07MB, 14us/14us, 0us/0us, 14us/14us) • arg73 (131.07MB/131.07MB, 15us/15us, 0us/0us, 15us/15us) • fprop/ (0B/4196.70MB, 0us/366.88ms, 0us/148.00ms, 0us/218.87ms)
- 1bwds_wpm_level_lm/tower_0_0/transpose_4 (4194.30MB/4194.30MB, 13.76ms/13.76ms, 13.33ms/13.33ms, 437us/437us) ==> size: FLOAT32, [1024, 32, 1, 32000] • gradients/fprop/1bwds_wpm_level_lm/tower_0_0/transpose_4_grad/transpose (4194.30MB/4194.30MB, 13.44ms/13.44ms, 13.37ms/13.37ms, 66us/66us) ==> size: FLOAT32, [1024, 32, 1, 32000] • mul_1 (268.44MB/268.44MB, 808.29ms/808.30ms, 24.80ms/24.80ms, 783.49ms/783.49ms) • zeros_like (262.14MB/262.14MB, 2us/2us, 0us/0us, 2us/2us)

From the above analysis, the forward softmax layer outputs (logits) size is 4.19 GB (namely fprop/1bwds_wpm_level_lm/tower_0_0/transpose_4) and it's corresponding backwards gradient (gradients/fprop/1bwds_wpm_level_lm/tower_0_0/transpose_4_grad/transpose) owns the same size, only these two parts consumes at least 8.4GB, while from Gpipe paper we found the claimed Peak Activation Memory of Pipeline-1 with re-computation enabled is only 6.4GB, which seems unreasonable here.

tensorflow / lingvo

Abnormal memory allocation of transformer with GPipe on V100-16GB #205