GPU memory leak when loading/unloading models

igrinis commented 1 year ago

Description When cycling through the load model -> infer -> unload model scenario we observe a GPU memory leak.

This only happens when models are in Torchscript format. There is no leak if the same models are converted to ONNX format. Also everything is ok when no inference is requested (only cycling through loading and unloading models).

Triton Information Are you using the Triton container or did you build it yourself? Tested with Nvidia's tritonserver:23.01-py3 and tritonserver:23.04-py3 docker images.

To Reproduce Start Triton server with --model-control-mode=explicit flag :

docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m -v $PWD:/models nvcr.io/nvidia/tritonserver:23.04-py3 bash -c "tritonserver --model-repository=/models --model-control-mode=explicit --disable-auto-complete-config"

Run a script that loads a model, runs inference and unloads it for few dozen times:

for i in $(seq 1 100); do curl -XPOST http://127.0.0.1:8000/v2/repository/models/1/load; curl -XPOST 127.0.0.1:8000/v2/models/1/infer -H 'Content-Type: application/json' -d @/ml_serving/v2_input.json; curl -XPOST http://127.0.0.1:8000/v2/repository/models/1/unload; echo $i; done

Each ~50 cycles we lose about 1Gb of GPU memory.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

We use a standard ROBERTa model for classification task.

config.pbtxt:

name: "1"
platform: "pytorch_libtorch"
default_model_filename: "model.pt"

input [
{
    name: "input_ids"
    data_type: TYPE_INT64
    dims: [-1, -1]
},
{
    name: "attention_mask"
    data_type: TYPE_INT64
    dims: [-1, -1]
}
]

output {
    name: "logits"
    data_type: TYPE_FP32
    dims: [-1, 1]
}

To create model.pt we use following script:

import os
import torch
from transformers import RobertaForSequenceClassification
from transformers import RobertaTokenizerFast

TS_MODEL_PATH = '/ml_serving/models/ts/1/1/model.pt'
os.makedirs(os.path.dirname(TS_MODEL_PATH), exist_ok=True)

class pyTorchToTorchScript(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.model = RobertaForSequenceClassification.from_pretrained('roberta-base', torchscript=True)

    def forward(self, *arg, **kwargs):
        x = self.model(*arg, **kwargs)
        return x[0]

tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base')
model = pyTorchToTorchScript()

sentences = ['Hello world!', 'Another simple sentence.']
model.eval()
toks = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True, max_length=200)
output = model(toks['input_ids'], toks['attention_mask'])
print('original:', output)

ts_model = torch.jit.trace(model, (toks['input_ids'], toks['attention_mask']), strict=False)
print('torch:', ts_model(toks['input_ids'], toks['attention_mask']))
ts_model.save(TS_MODEL_PATH)

v2_input.json:

{"inputs":[
    {"id":10,"name":"input_ids","shape":[10,17],"datatype":"INT64","data":[[0,100,524,3861,5187,165,1044,8,52,33,316,82,11,5,165,4,2],[0,170,32,5,4739,328,2,1,1,1,1,1,1,1,1,1,1],[0,100,524,3861,5187,165,1044,8,52,33,316,82,11,5,165,4,2],[0,170,32,5,4739,328,2,1,1,1,1,1,1,1,1,1,1],[0,100,524,3861,5187,165,1044,8,52,33,316,82,11,5,165,4,2],[0,170,32,5,4739,328,2,1,1,1,1,1,1,1,1,1,1],[0,100,524,3861,5187,165,1044,8,52,33,316,82,11,5,165,4,2],[0,170,32,5,4739,328,2,1,1,1,1,1,1,1,1,1,1],[0,100,524,3861,5187,165,1044,8,52,33,316,82,11,5,165,4,2],[0,170,32,5,4739,328,2,1,1,1,1,1,1,1,1,1,1]]},
    {"id":11,"name":"attention_mask","shape":[10,17],"datatype":"INT64","data":[[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],[1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0],[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],[1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0],[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],[1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0],[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],[1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0],[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],[1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0]]}]
}

Expected behavior Triton server should release all the memory that was allocated for a model after it got unloaded. In a long run memory utilization expected to be stable.

dyastremsky commented 1 year ago

Thank you for the detailed bug report. We've filed a ticket to investigate.

igrinis commented 1 year ago

Hey guys! Any progress on the issue?

stefan-ax commented 1 year ago

Hi, I am also facing the issue and looking for a solution.

dyastremsky commented 1 year ago

Thank you for letting us know. This is still in our queue. We'll investigate soon.

thortom commented 1 year ago

I am encountering this as well.

dyastremsky commented 1 year ago

Thank you for letting us know, Thor.

As an update, we are able to reproduce this on our end as well and have been actively working on it. This was introduced in 22.12 after some changes in the PyTorch upstream that month. It should have been caught by our testing then but was not. We fixed the related tests. We are working with PyTorch folks to provide a fix as soon as we can.

bruce99kang commented 9 months ago

Facing the same issue. Any updates?

kadmor commented 8 months ago

Hello, is there a solution or update?

dyastremsky commented 7 months ago

Not yet. We are working on a reproducer running within PyTorch standalone to try to identify the source of the memory growth.

Kokkini commented 2 months ago

Hello, I'm facing the same issue. Any updates?

dyastremsky commented 2 months ago

Not yet. We do not yet have a reproducer isolated to PyTorch or a root cause identified on the Triton side.

Ref: DLIS-4941. CC: @krishung5

triton-inference-server / server

GPU memory leak when loading/unloading models #5841