Open Arashi19901001 opened 1 year ago
Hi @Arashi19901001,
Hi @Arashi19901001,
- for the memory leak assumption, could you run with tools like Valgrind and see if there are any leaks reported?
- We have been observing some memory growth when loading/unloading TF models due to the heuristics from TF memory allocator. I suspect that the growing memory consumption might have something to do with that as well. Could you try to switch to tcmalloc and see if the memory consumption is better?
@krishung5
I didn't do loading and unloading models after I started the container from
Now my problem is why does my triton service costs som much memory? I've done some tests on another mache:
Delete the warmup part in config.pbtxt
With warmup in config.pbtxt
The results of two tests look same. After the service is warmed up, it do takes about 16G for my gpu service and 7 G for cpu service. Is there any way to reduce memory costs?
Before migrating to triton server, I had tried tensorflow-serving and it's memory costs is about half when compared to triton server using tensorfow as backend. Did I misconfigure something?
Also interested in this. We have the tensorflow backend, with the option to limit GPU memory usage to 0.25. However tritonserver eats about 0.35, much more than should be required
Also interested -- right now we have exclusively python backend models. I generally see gpu memory usage increase over time to saturate the GPU, based on metrics from Prometheus. It doesn't necessarily lead to errors, but it does seem like something isn't being released.
We have a similar issue where Triton uses a huge amount of CPU memory even tho our models are using the ONNX+TensorRT backend. So I would expect there to not be much CPU memory usage at all.
For example, one of our clusters has 3 models where the onnx files total 674MB, yet Triton uses 16GB of CPU RAM. This is how much it uses just to start and the memory usage does not change during use.
We also noticed a large jump in CPU RAM usage from Triton 23.08 onwards. For example, the cluster above was only using 12GB on version 23.07.
We have another cluster with 2.2GB of onnx model files. Here is the memory usage of Triton (23.07) as those models are loaded:
It ends up using about 23GB of RAM. If I try and load the same models in Triton versions 23.08 - 23.11 then it hits the 40GB memory limit and is OOM killed. Again, these are all ONNX with TensorRT.
I have tried loading Triton with the same ONNX models but with the TensorRT optimization removed (so they should be using CUDA) and the CPU memory usage drops to only 3GB. So it definitely seems that the TensorRT optimization of ONNX is the problem in my situation.
Further to the above, I have tried manually converting the ONNX models to TensorRT and loading those. Then the memory usage is only 3GB. So it is almost certainly the conversion process from ONNX to TensorRT in the ONNX backend that is causing the memory usage.
Description two command:
run with gpu
max_batch_size
is set to 64nvidia-smi
: it takes 4589MiB, it seems ok.htop
: it takes 56.9% of 32GB, about 18GBrun with cpu
14 models in total, about 460MB
served with tensorflow backend
all
max_batch_size
is set to 64htop
: it takes 30.2% of 32GB, about 9.7GBtwo questions
reproduce
Expected behavior