[BUG] GPU memory leaking after deleting deepspeed engine

kfertakis commented 1 week ago

Describe the bug Initialising a trainable model with deepspeed and then deleting the engine leaves GPU memory still allocated.

To Reproduce Running the following simple test script shows that GPU memory remains allocated even after all references to the deepspeed engine are deleted.

import torch
import torch.nn as nn
import torch.nn.functional as F
import deepspeed
import gc

class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(1024*1024, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

model = SimpleNet()

ds_config = {
    "train_batch_size": 8,
    "steps_per_print": 10,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001
        }
    },
    "fp16": {
        "enabled": True
    }
}

print(f"initial Allocated Memory: {torch.cuda.memory_allocated() / 1024 ** 2 :.2f} MB") 

model_engine, optimizer, t , l = deepspeed.initialize(
    model=model,
    model_parameters=model.parameters(),
    config=ds_config,
    dist_init_required=True
)
print(f"before model moved to GPU Allocated Memory: {torch.cuda.memory_allocated() / 1024 ** 2:.2f} MB") #700.01 MB

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_engine.to(device)

print(f"after model moved Allocated Memory: {torch.cuda.memory_allocated() / 1024 ** 2:.2f} MB") #700.01 MB
model_engine = None
optimizer = None
t = None
l = None
model = None
del model_engine
del optimizer
del t 
del l
gc.collect()
torch.cuda.empty_cache()
print(f"after refs deleted Allocated Memory: {torch.cuda.memory_allocated() / 1024 ** 2:.2f} MB") #100.00 MB

Expected behavior GPU memory should be freed when a deepspeed engine gets deleted. Is there another API available for releasing the allocated GPU memory without having to kill the process? Thanks

System info:

OS: Ubuntu 20.04.6 LTS
GPU count and types: 1 machine with 4x NVIDIA RTX A6000
Python version: 3.9

Launcher context deepspeed --num_gpus=1 --master_port 12346 test_deepspeed.py

adk9 commented 3 days ago

Is there another API available for releasing the allocated GPU memory without having to kill the process?

Hi @kfertakis, Yes, there is an explicit API to free up engine resources. You can call model_engine.destroy() to reclaim the allocated memory.

kfertakis commented 3 days ago

Thank you for the reference. This does the job.

microsoft / DeepSpeed

[BUG] GPU memory leaking after deleting deepspeed engine #5674