triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.07k stars 1.45k forks source link

Triton Server costs too much memory #5392

Open Arashi19901001 opened 1 year ago

Arashi19901001 commented 1 year ago

Description two command:

run with gpu

docker run \
    -d \
    --name <name1> \
    --gpus device=0 \
    --entrypoint /opt/tritonserver/bin/tritonserver \
    -p $PORT:8000 \
    -t <harbor>:<image1> \
    --model-repository=/models

run with cpu

docker run \
    -d \
    --name <name2> \
    --entrypoint /opt/tritonserver/bin/tritonserver \
    -p $CPU_PORT:8000 \
    -t <harbor>:<image2> \
    --model-repository=/models

reproduce

Expected behavior

krishung5 commented 1 year ago

Hi @Arashi19901001,

  1. for the memory leak assumption, could you run with tools like Valgrind and see if there are any leaks reported?
  2. We have been observing some memory growth when loading/unloading TF models due to the heuristics from TF memory allocator. I suspect that the growing memory consumption might have something to do with that as well. Could you try to switch to tcmalloc and see if the memory consumption is better?
Arashi19901001 commented 1 year ago

Hi @Arashi19901001,

  1. for the memory leak assumption, could you run with tools like Valgrind and see if there are any leaks reported?
  2. We have been observing some memory growth when loading/unloading TF models due to the heuristics from TF memory allocator. I suspect that the growing memory consumption might have something to do with that as well. Could you try to switch to tcmalloc and see if the memory consumption is better?

@krishung5

whateverforever commented 1 year ago

Also interested in this. We have the tensorflow backend, with the option to limit GPU memory usage to 0.25. However tritonserver eats about 0.35, much more than should be required

michael-ryan-warner commented 1 year ago

Also interested -- right now we have exclusively python backend models. I generally see gpu memory usage increase over time to saturate the GPU, based on metrics from Prometheus. It doesn't necessarily lead to errors, but it does seem like something isn't being released.

damonmaria commented 9 months ago

We have a similar issue where Triton uses a huge amount of CPU memory even tho our models are using the ONNX+TensorRT backend. So I would expect there to not be much CPU memory usage at all.

For example, one of our clusters has 3 models where the onnx files total 674MB, yet Triton uses 16GB of CPU RAM. This is how much it uses just to start and the memory usage does not change during use.

We also noticed a large jump in CPU RAM usage from Triton 23.08 onwards. For example, the cluster above was only using 12GB on version 23.07.

We have another cluster with 2.2GB of onnx model files. Here is the memory usage of Triton (23.07) as those models are loaded:

image

It ends up using about 23GB of RAM. If I try and load the same models in Triton versions 23.08 - 23.11 then it hits the 40GB memory limit and is OOM killed. Again, these are all ONNX with TensorRT.

I have tried loading Triton with the same ONNX models but with the TensorRT optimization removed (so they should be using CUDA) and the CPU memory usage drops to only 3GB. So it definitely seems that the TensorRT optimization of ONNX is the problem in my situation.

damonmaria commented 9 months ago

Further to the above, I have tried manually converting the ONNX models to TensorRT and loading those. Then the memory usage is only 3GB. So it is almost certainly the conversion process from ONNX to TensorRT in the ONNX backend that is causing the memory usage.