microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.71k stars 4.05k forks source link

[BUG] a huge memory leak when using `register_full_backward_hook` #1572

Open stas00 opened 2 years ago

stas00 commented 2 years ago

Describe the bug

When trying to use register_full_backward_hook in Megatron-Deepspeed, I get a huge memory leak.

I'm reporting it here, since when I turn off deepspeed, there is no leak.

To Reproduce

I tried to create a small independent example that uses deepspeed directly but I couldn't make it leak.

So, let's work with Megatron-Deepspeed. We can use either the bigscience version or your original one - it leaks in both versions (since the problem is triggered by deepspeed).

git clone https://github.com/microsoft/Megatron-DeepSpeed
cd Megatron-DeepSpeed

now apply this patch:

diff --git a/megatron/mpu/cross_entropy.py b/megatron/mpu/cross_entropy.py
index 8c790cd..a0b40b1 100644
--- a/megatron/mpu/cross_entropy.py
+++ b/megatron/mpu/cross_entropy.py
@@ -107,4 +107,4 @@ class _VocabParallelCrossEntropy(torch.autograd.Function):

 def vocab_parallel_cross_entropy(vocab_parallel_logits, target):
     """Helper function for the cross entropy."""
-    return _VocabParallelCrossEntropy.apply(vocab_parallel_logits, target)
+    return _VocabParallelCrossEntropy.apply(vocab_parallel_logits.clone(), target)
diff --git a/megatron/training.py b/megatron/training.py
index e3a168c..9389029 100644
--- a/megatron/training.py
+++ b/megatron/training.py
@@ -692,6 +692,13 @@ def train(forward_step_func, model, optimizer, lr_scheduler,
     # Write args to tensorboard
     write_args_to_tensorboard()

+    def backward_hook(module, input, output): pass
+    def _register_backward_hook(module):
+        module.register_full_backward_hook(backward_hook)
+        #module.register_backward_hook(backward_hook)
+    model[0].apply(_register_backward_hook)
+
+
     # Turn on training mode which enables dropout.
     for model_module in model:
         model_module.train()

The cross_entropy change has to do with an issue in megatron-lm - unrelated to this issue, but is required to be able to use backward hooks.

Now you can see that I'm adding a no-op backward hook. A very trivial change.

If I use the new register_full_backward_hook I get a huge leak, when running train. If I use the deprecated register_backward_hook all is good.

If I turn off deepspeed the leak goes away as well.

I experimented with removing various configs, disabling Z1 - didn't make a difference, so it's somewhere in the engine.

I started researching the cause of the leak in general and found this discussion: https://discuss.pytorch.org/t/register-full-backward-hook-causes-memory-leak/122904 which suggests that somewhere backward creates a graph which creates a self-reference loop, so the tensors never get released.

Using the above patch you should be able to reproduce the leak withing 10 iterations on a tiny model. I'm not sure how you test Megatron-Deepspeed. You can for example use our test suite from https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/tests/test_training.py.

or you can use this, but you will need to create a bit of data and grab the vocab files from https://github.com/NVIDIA/Megatron-LM#downloading-checkpoints

CHECKPOINT_PATH=checkpoints/gpt2

VOCAB_FILE=data/gpt2-vocab.json
MERGE_FILE=data/gpt2-merges.txt
#DATA_PATH=data/meg-gpt2_text_document
DATA_PATH=data/meg-gpt2_oscar-combined_text_document
TENSORBOARD_PATH=output_dir/tensorboard

N_GPUS=2
MICRO_BATCH_SIZE=1
GLOBAL_BATCH_SIZE=16
TP_SIZE=2
PP_SIZE=1

SEQ_LEN=1024

SAVE_INTERVAL=50

#    --train-samples 10_000 \
#    --exit-interval $EXIT_INTERVAL \

GPT_ARGS=" \
    --num-layers 2 \
    --hidden-size 64 \
    --num-attention-heads 2 \
    --ffn-hidden-size 256 \
    --seq-length $SEQ_LEN \
    --max-position-embeddings $SEQ_LEN \
    --micro-batch-size $MICRO_BATCH_SIZE \
    --rampup-batch-size 2 2 1_000 \
    --global-batch-size $GLOBAL_BATCH_SIZE \
    --train-samples 100 \
    --optimizer adam \
    --adam-beta1 0.9 \
    --adam-beta2 0.95 \
    --adam-eps 1e-8 \
    --lr 1e-4 \
    --lr-warmup-samples 5 \
    --clip-grad 1.0 \
    --weight-decay 1e-1 \
    --vocab-file $VOCAB_FILE \
    --merge-file $MERGE_FILE \
    --fp16 \
    --partition-activations \
    --seed 42 \
    "
#    --tokenizer-type PretrainedFromHF \
#    --tokenizer-name-or-path t5-small \
#    --train-iters 500 \

OUTPUT_ARGS=" \
    --exit-interval 100 \
    --log-interval 10 \
    --save-interval $SAVE_INTERVAL \
    --eval-interval 100 \
    --eval-iters 10 \
    --checkpoint-activations \
    "

DATA_ARGS=" \
    --save $CHECKPOINT_PATH \
    --load $CHECKPOINT_PATH \
    --data-path $DATA_PATH \
    --tensorboard-dir $TENSORBOARD_PATH \
    --tensorboard-queue-size 5 \
    --log-timers-to-tensorboard \
    --log-batch-size-to-tensorboard \
    --log-validation-ppl-to-tensorboard \
    "

ZERO_STAGE=1

config_json="./ds_config.json"

# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
cat <<EOT > $config_json
{
  "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
  "train_batch_size": $GLOBAL_BATCH_SIZE,
  "gradient_clipping": 1.0,
  "zero_optimization": {
    "stage": $ZERO_STAGE
  },
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 500,
    "hysteresis": 2,
    "min_loss_scale": 1,
    "initial_scale_power": 12
  },
  "steps_per_print": 2000,
  "wall_clock_breakdown": false
}
EOT

DEEPSPEED_ARGS=" \
    --deepspeed \
    --deepspeed_config ${config_json} \
    --zero-stage ${ZERO_STAGE} \
    --deepspeed-activation-checkpointing \
    "

ALL_ARGS="$GPT_ARGS $OUTPUT_ARGS $DATA_ARGS $DEEPSPEED_ARGS"

# if you can't stand pt-1.9 launcher noise
export LOGLEVEL=WARNING

#PYTHONPATH=~/github/00optimize/deepspeed-big-science:/hf/Megatron-DeepSpeed-master
#PYTHONPATH=/hf/Megatron-DeepSpeed-master

LAUNCHER="deepspeed --num_gpus $N_GPUS --master_port 6777"
export CMD=" \
    env USE_TF=0 \
    $LAUNCHER pretrain_gpt.py \
    --tensor-model-parallel-size $TP_SIZE \
    --pipeline-model-parallel-size $PP_SIZE \
    --distributed-backend nccl \
    $ALL_ARGS \
    "

echo $CMD

#rm -rf $CHECKPOINT_PATH
$CMD

I'm testing with pytorch-1.10, and deepspeed@master.

Thank you!

@jeffra, @tjruwase

nyngwang commented 1 year ago

Did you solve it?