When trying to use register_full_backward_hook in Megatron-Deepspeed, I get a huge memory leak.
I'm reporting it here, since when I turn off deepspeed, there is no leak.
To Reproduce
I tried to create a small independent example that uses deepspeed directly but I couldn't make it leak.
So, let's work with Megatron-Deepspeed. We can use either the bigscience version or your original one - it leaks in both versions (since the problem is triggered by deepspeed).
git clone https://github.com/microsoft/Megatron-DeepSpeed
cd Megatron-DeepSpeed
now apply this patch:
diff --git a/megatron/mpu/cross_entropy.py b/megatron/mpu/cross_entropy.py
index 8c790cd..a0b40b1 100644
--- a/megatron/mpu/cross_entropy.py
+++ b/megatron/mpu/cross_entropy.py
@@ -107,4 +107,4 @@ class _VocabParallelCrossEntropy(torch.autograd.Function):
def vocab_parallel_cross_entropy(vocab_parallel_logits, target):
"""Helper function for the cross entropy."""
- return _VocabParallelCrossEntropy.apply(vocab_parallel_logits, target)
+ return _VocabParallelCrossEntropy.apply(vocab_parallel_logits.clone(), target)
diff --git a/megatron/training.py b/megatron/training.py
index e3a168c..9389029 100644
--- a/megatron/training.py
+++ b/megatron/training.py
@@ -692,6 +692,13 @@ def train(forward_step_func, model, optimizer, lr_scheduler,
# Write args to tensorboard
write_args_to_tensorboard()
+ def backward_hook(module, input, output): pass
+ def _register_backward_hook(module):
+ module.register_full_backward_hook(backward_hook)
+ #module.register_backward_hook(backward_hook)
+ model[0].apply(_register_backward_hook)
+
+
# Turn on training mode which enables dropout.
for model_module in model:
model_module.train()
The cross_entropy change has to do with an issue in megatron-lm - unrelated to this issue, but is required to be able to use backward hooks.
Now you can see that I'm adding a no-op backward hook. A very trivial change.
If I use the new register_full_backward_hook I get a huge leak, when running train. If I use the deprecated register_backward_hook all is good.
If I turn off deepspeed the leak goes away as well.
I experimented with removing various configs, disabling Z1 - didn't make a difference, so it's somewhere in the engine.
Describe the bug
When trying to use
register_full_backward_hook
in Megatron-Deepspeed, I get a huge memory leak.I'm reporting it here, since when I turn off deepspeed, there is no leak.
To Reproduce
I tried to create a small independent example that uses deepspeed directly but I couldn't make it leak.
So, let's work with Megatron-Deepspeed. We can use either the bigscience version or your original one - it leaks in both versions (since the problem is triggered by deepspeed).
now apply this patch:
The
cross_entropy
change has to do with an issue in megatron-lm - unrelated to this issue, but is required to be able to use backward hooks.Now you can see that I'm adding a no-op backward hook. A very trivial change.
If I use the new
register_full_backward_hook
I get a huge leak, when running train. If I use the deprecatedregister_backward_hook
all is good.If I turn off deepspeed the leak goes away as well.
I experimented with removing various configs, disabling Z1 - didn't make a difference, so it's somewhere in the engine.
I started researching the cause of the leak in general and found this discussion: https://discuss.pytorch.org/t/register-full-backward-hook-causes-memory-leak/122904 which suggests that somewhere
backward
creates a graph which creates a self-reference loop, so the tensors never get released.Using the above patch you should be able to reproduce the leak withing 10 iterations on a tiny model. I'm not sure how you test Megatron-Deepspeed. You can for example use our test suite from https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/tests/test_training.py.
or you can use this, but you will need to create a bit of data and grab the vocab files from https://github.com/NVIDIA/Megatron-LM#downloading-checkpoints
I'm testing with pytorch-1.10, and deepspeed@master.
Thank you!
@jeffra, @tjruwase