Open fanshiqing opened 4 years ago
Detailed breakdown of memory allocation from tfprof
is listed as follows:
BATCH_SIZE=32 VOCAB_SIZE = 32000 EMBEDDING_DIM = 2048 MAX_SEQUENCE_LENGTH = 1024
node name | requested bytes | total execution time | accelerator execution time | cpu execution time _TFProfRoot (--/17327.34MB, --/6.22sec, --/1.36sec, --/4.86sec) • 1bwds_wpm_level_lm/ (0B/725.85MB, 0us/2.76ms, 0us/0us, 0us/2.76ms)
From the above analysis, the forward softmax layer outputs (logits
) size is 4.19 GB
(namely fprop/1bwds_wpm_level_lm/tower_0_0/transpose_4
) and it's corresponding backwards gradient (gradients/fprop/1bwds_wpm_level_lm/tower_0_0/transpose_4_grad/transpose) owns the same size, only these two parts consumes at least 8.4GB
, while from Gpipe paper we found the claimed Peak Activation Memory
of Pipeline-1 with re-computation enabled is only 6.4GB
, which seems unreasonable here.
Codebase Version:
Hardward config
Problem Description For default hyper-parameter as described here, direct use of these configurations will cause OOM on V100-16GB.
The key model hyper-params are listed as follows, which are actually aligned with configs in Table 1 of GPipe paper:
The running script is as follows.
Actually experiments in my config shows that the
maximum transformer layer per GPU
before OOM is only1
with other model hyper-configurations kept unchanged (while the transformer layer per GPU is8
as described in the codebase), which is confusing.Then I tune
BATCH_SIZE
from 1 to 32 and breakdown the actual total peak GPU memory consumption as follows:In summary, based on the measured data listed above, it seems that: (1) The re-computation feature of forward layer when needed seems not working as expected; (2) Even with (1) disabled, the native memory consumption of this transformer is much larger than that reported in codebase (drop from 8 to 1 transformer_layer_per_gpu before OOM).
Thanks in advance for any help! @bignamehyp