tensorflow / lingvo

Lingvo
Apache License 2.0
2.82k stars 445 forks source link

Abnormal memory allocation of transformer with GPipe on V100-16GB #205

Open fanshiqing opened 4 years ago

fanshiqing commented 4 years ago

Codebase Version:

Hardward config

Problem Description For default hyper-parameter as described here, direct use of these configurations will cause OOM on V100-16GB.

The key model hyper-params are listed as follows, which are actually aligned with configs in Table 1 of GPipe paper:

BATCH_SIZE= 32
VOCAB_SIZE = 32000
EMBEDDING_DIM = 2048
max_sequence_length= 1024
Optimizer: RMSProp

The running script is as follows.

num_gpu=1  
./bazel-bin/lingvo/trainer  \
  --run_locally=gpu --mode=sync \
  --model=lm.one_billion_wds.OneBWdsGPipeTransformerWPM \
  --logdir=/tmp/lm/log/ \
  --logtostderr \
  --worker_split_size=${num_gpu} \
  --worker_gpus=${num_gpu} \

Actually experiments in my config shows that the maximum transformer layer per GPU before OOM is only 1 with other model hyper-configurations kept unchanged (while the transformer layer per GPU is 8 as described in the codebase), which is confusing.

Then I tune BATCH_SIZE from 1 to 32 and breakdown the actual total peak GPU memory consumption as follows: image image

In summary, based on the measured data listed above, it seems that: (1) The re-computation feature of forward layer when needed seems not working as expected; (2) Even with (1) disabled, the native memory consumption of this transformer is much larger than that reported in codebase (drop from 8 to 1 transformer_layer_per_gpu before OOM).

Thanks in advance for any help! @bignamehyp

fanshiqing commented 4 years ago

Detailed breakdown of memory allocation from tfprof is listed as follows:

BATCH_SIZE=32 VOCAB_SIZE = 32000 EMBEDDING_DIM = 2048 MAX_SEQUENCE_LENGTH = 1024

node name | requested bytes | total execution time | accelerator execution time | cpu execution time _TFProfRoot (--/17327.34MB, --/6.22sec, --/1.36sec, --/4.86sec) • 1bwds_wpm_level_lm/ (0B/725.85MB, 0us/2.76ms, 0us/0us, 0us/2.76ms)

From the above analysis, the forward softmax layer outputs (logits) size is 4.19 GB (namely fprop/1bwds_wpm_level_lm/tower_0_0/transpose_4) and it's corresponding backwards gradient (gradients/fprop/1bwds_wpm_level_lm/tower_0_0/transpose_4_grad/transpose) owns the same size, only these two parts consumes at least 8.4GB, while from Gpipe paper we found the claimed Peak Activation Memory of Pipeline-1 with re-computation enabled is only 6.4GB, which seems unreasonable here.

image