triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.37k stars 1.49k forks source link

Memory Leak in NVIDIA Triton Server (v24.09-py3) with model-control-mode=explicit #7727

Open Mustafiz48 opened 1 month ago

Mustafiz48 commented 1 month ago

This issue reports a potential memory leak observed when running NVIDIA Triton Server (v24.09-py3) with model-control-mode=explicit. The server seems to hold onto physical RAM after inference requests are completed, leading to memory exhaustion over time.

I am using triton inference server to host segment anything model by facebook. I have exported the encoder part to ONNX format with this notebook. With the following code, I am running inference only on the sam_encoder model, which is onnx model hosted with triton [24.09].

Here's the system configuration:

Intel(R) Core(TM) i7-9700
RAM: 32GB
VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)

Ubuntu 20.04.6 LTS
Release: 20.04
Codename: focal

I am using following code to test the inference server:

import time
import numpy as np
import tritonclient
import tritonclient.http as httpclient
import requests

inference_server_url = "http://localhost:8000"
triton_server = httpclient.InferenceServerClient(url="localhost:8000")

model_name = "sam_encoder"
n= 10 # Number of inference to test

def load_model(model_name):
    print(f"loading model: {model_name}")
    response = requests.post(f"{inference_server_url}/v2/repository/models/{model_name}/load")
    if response.status_code == 200:
        print(f"Model {model_name} loaded successfully.")
    else:
        print(f"Failed to load model {model_name}: {response.text}")

def unload_model(model_name):
    print(f"Unloading model: {model_name}")
    response = requests.post(f"{inference_server_url}/v2/repository/models/{model_name}/unload")
    if response.status_code == 200:
        print(f"Model {model_name} unloaded successfully.")
    else:
        print(f"Failed to unload model {model_name}: {response.text}")

with open(f"input_tensor_{model_name}.txt", 'r') as file:
    input_tensor = eval(file.read(), {'array':np.array, "float32": np.float32})

input_tensor_inference = tritonclient.http.InferInput("images", input_tensor.shape, datatype="FP32")
input_tensor_inference.set_data_from_numpy(input_tensor)

for i in range(n):
    print(f"\n{i+1}/{n}: calling inference server....")
    load_model(model_name)
    embedding_result = triton_server.infer(model_name=model_name, inputs=[input_tensor_inference])
    image_embedding = embedding_result.as_numpy("embeddings")
    unload_model(model_name)

    print("Got response from the inference server")
    time.sleep(3)

Here is the docker-compose.yml file I am using to run the tritron server:

services:
  triton:
    image: nvcr.io/nvidia/tritonserver:24.09-py3
    container_name: triton-server
    runtime: nvidia
    command: tritonserver --model-repository=/models --model-control-mode=explicit
    ports:
      - "8000:8000"
      - "8001:8001"
      - "8002:8002"
    volumes:
      - ./model-repository:/models
    restart: always
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Here is the input file to test the code.

When I run the code, ite executes properly. But it keep holding physical ram. After the inference is done, it doesn't release the physical memory. With load method, it loads the model into gpu, with unload it realeases the model from gpu. So with GPU we don't have any issue.

But the physical RAM is growing with more requests. Here's the graph showing RAM usage by tritron.

triton_holding_ram

Even if it comes to a saturation points where it dosen't consumes any more ram for the same model (for example: sam_encoder.onnx), it doesn't realease that RAM. So if I try to perform inference with some new other model, it will start consuming that RAM for that model also. Eventually it consumes all the RAM of the machine and leads to frozen state. If we delete the tritron container, it instantly releases the memory.

rmccorm4 commented 3 weeks ago

Hi @Mustafiz48, thanks for raising this issue. Do you mind trying tcmalloc and jemalloc as described in this doc and see if either alleviates the memory-holding issue you're seeing?

CC @krishung5

Mustafiz48 commented 3 weeks ago

Hi @rmccorm4 , thanks for you response.

Yes after trying tcmalloc, it reduced the memory consumption by a lot.

this is the graph showing memory consumption before before_tcmalloc

And this is the graph after trying tcmalloc. after_tcmalloc

Though it's not releasing all the memory even after unloading the models, it reduced the increase of consumption after each call.

Btw, is there any way to release all the memory after unloading the models? In the graph you can see Triton still keeps holding around 2GB of memory. By that time, all the models were already unloaded.

timstokman commented 3 weeks ago

I ran into this as well. Switching to jemalloc solved this. Is there a reason jemalloc or tcmalloc is not the default? Memory usage can be quite unpredictable for tritonserver. Good defaults might help.

rmccorm4 commented 3 weeks ago

Hi @timstokman, I believe the optimal memory allocation behavior differs between frameworks. I believe we have found that different scenarios benefit from different allocators in different ways.

For example, some scenarios like loading/unloading the same model repeatedly (same/similar chunk of memory) may have better characteristics using tcmalloc, and some scenarios like loading/unloading many unique/different models (differently sized chunks of memory, more fragmentation) may have better characteristics using jemalloc, and it is hard to pick one that will always be better - so we defer to the "standard" default malloc that may be more widely portable across platforms, while sharing some instructions for users to explore the alternatives based on their use case.

timstokman commented 3 weeks ago

So some backends perform worse with jemalloc or tcmalloc? Significantly worse? Anything we should know about?

The change to jemalloc reduced my memory usage by about a factor of 3, pretty significant.

krishung5 commented 3 weeks ago

Hi @timstokman, in our previous experiment, we didn't observe worse memory usage with jemalloc or tcmalloc, it just didn't help that much comparing to ONNX and TF frameworks. I think it highly depends on which framework and workload, so we encourage users to experiment with their setup and choose the one that fits.