Triton Server costs too much memory

Arashi19901001 commented 1 year ago

Description two command:

run with gpu

docker run \
    -d \
    --name <name1> \
    --gpus device=0 \
    --entrypoint /opt/tritonserver/bin/tritonserver \
    -p $PORT:8000 \
    -t <harbor>:<image1> \
    --model-repository=/models

27 models in total, about 1.9GB
all max_batch_size is set to 64
served with tensorflow backend
nvidia-smi: it takes 4589MiB， it seems ok.
htop： it takes 56.9% of 32GB, about 18GB

run with cpu

docker run \
    -d \
    --name <name2> \
    --entrypoint /opt/tritonserver/bin/tritonserver \
    -p $CPU_PORT:8000 \
    -t <harbor>:<image2> \
    --model-repository=/models

14 models in total, about 460MB
served with tensorflow backend
all max_batch_size is set to 64
htop： it takes 30.2% of 32GB, about 9.7GB
two questions
- Why does triton server take so many memory, how can I reduce memory consumption?
- For run in cpu, the memory costs slightly increased. About 0.5% after 5 hours infer service. Is there any memory leak problem here? Triton Information Triton version is 22.12, I built my own image, but did nothing more except copy models in.

reproduce

config.pbtxt looks like below

max_batch_size: 64
input [
{
  name: "input"
  data_type: TYPE_FP32
  dims: [ -1 ]
}
]
output [
{
  name: "output/Softmax"
  data_type: TYPE_FP32
  dims: [ 2 ]
}
]
model_warmup [{
name: "warmup"
batch_size: 1
inputs {
key: "input"
value: {
  data_type: TYPE_FP32
  dims: 1
  zero_data: true
}
}
}]

Expected behavior

Low memory consumption
Solve memory leak problem if there is

krishung5 commented 1 year ago

Hi @Arashi19901001,

for the memory leak assumption, could you run with tools like Valgrind and see if there are any leaks reported?
We have been observing some memory growth when loading/unloading TF models due to the heuristics from TF memory allocator. I suspect that the growing memory consumption might have something to do with that as well. Could you try to switch to tcmalloc and see if the memory consumption is better?

Arashi19901001 commented 1 year ago

Hi @Arashi19901001,

for the memory leak assumption, could you run with tools like Valgrind and see if there are any leaks reported?

We have been observing some memory growth when loading/unloading TF models due to the heuristics from TF memory allocator. I suspect that the growing memory consumption might have something to do with that as well. Could you try to switch to tcmalloc and see if the memory consumption is better?

@krishung5

I didn't do loading and unloading models after I started the container from run on cpu. The memory consumption increased because I kept sending requests from client. Here are result below: Memory sightly grows at daytime, and at 1:00 pm, requests from clients incease sharply，so memory consumption grows as well. But as memory consumption does not grows sharply and then drops in later time, I gusess it‘s not a memory leak problem here, and a sight growth in memory is acceptable.
Now my problem is why does my triton service costs som much memory? I've done some tests on another mache:
Delete the warmup part in config.pbtxt
- right after two service startup
  - gpu: 21.2%, 6774M
  - cpu: 9.3%, 2949M
- after posting about 2000 requets from client
  - gpu: 46.6% 14.5G
  - cpu: 23.3%， 7441M
With warmup in config.pbtxt
- right after two service startup
  - gpu: 54.2% 16.9G
  - cpu: 23.1% 7353M
- after posting about 2000 requets from client
  - gpu: 54.2% 16.9G
  - cpu: 23.1% 7353M
The results of two tests look same. After the service is warmed up, it do takes about 16G for my gpu service and 7 G for cpu service. Is there any way to reduce memory costs?
Before migrating to triton server, I had tried tensorflow-serving and it's memory costs is about half when compared to triton server using tensorfow as backend. Did I misconfigure something？

whateverforever commented 1 year ago

Also interested in this. We have the tensorflow backend, with the option to limit GPU memory usage to 0.25. However tritonserver eats about 0.35, much more than should be required

michael-ryan-warner commented 1 year ago

Also interested -- right now we have exclusively python backend models. I generally see gpu memory usage increase over time to saturate the GPU, based on metrics from Prometheus. It doesn't necessarily lead to errors, but it does seem like something isn't being released.

damonmaria commented 9 months ago

We have a similar issue where Triton uses a huge amount of CPU memory even tho our models are using the ONNX+TensorRT backend. So I would expect there to not be much CPU memory usage at all.

For example, one of our clusters has 3 models where the onnx files total 674MB, yet Triton uses 16GB of CPU RAM. This is how much it uses just to start and the memory usage does not change during use.

We also noticed a large jump in CPU RAM usage from Triton 23.08 onwards. For example, the cluster above was only using 12GB on version 23.07.

We have another cluster with 2.2GB of onnx model files. Here is the memory usage of Triton (23.07) as those models are loaded:

It ends up using about 23GB of RAM. If I try and load the same models in Triton versions 23.08 - 23.11 then it hits the 40GB memory limit and is OOM killed. Again, these are all ONNX with TensorRT.

I have tried loading Triton with the same ONNX models but with the TensorRT optimization removed (so they should be using CUDA) and the CPU memory usage drops to only 3GB. So it definitely seems that the TensorRT optimization of ONNX is the problem in my situation.

damonmaria commented 9 months ago

Further to the above, I have tried manually converting the ONNX models to TensorRT and loading those. Then the memory usage is only 3GB. So it is almost certainly the conversion process from ONNX to TensorRT in the ONNX backend that is causing the memory usage.

triton-inference-server / server

Triton Server costs too much memory #5392

run with gpu

run with cpu