Open Mustafiz48 opened 1 month ago
Hi @Mustafiz48, thanks for raising this issue. Do you mind trying tcmalloc
and jemalloc
as described in this doc and see if either alleviates the memory-holding issue you're seeing?
CC @krishung5
Hi @rmccorm4 , thanks for you response.
Yes after trying tcmalloc, it reduced the memory consumption by a lot.
this is the graph showing memory consumption before
And this is the graph after trying tcmalloc.
Though it's not releasing all the memory even after unloading the models, it reduced the increase of consumption after each call.
Btw, is there any way to release all the memory after unloading the models? In the graph you can see Triton still keeps holding around 2GB of memory. By that time, all the models were already unloaded.
I ran into this as well. Switching to jemalloc solved this. Is there a reason jemalloc or tcmalloc is not the default? Memory usage can be quite unpredictable for tritonserver. Good defaults might help.
Hi @timstokman, I believe the optimal memory allocation behavior differs between frameworks. I believe we have found that different scenarios benefit from different allocators in different ways.
For example, some scenarios like loading/unloading the same model repeatedly (same/similar chunk of memory) may have better characteristics using tcmalloc, and some scenarios like loading/unloading many unique/different models (differently sized chunks of memory, more fragmentation) may have better characteristics using jemalloc, and it is hard to pick one that will always be better - so we defer to the "standard" default malloc that may be more widely portable across platforms, while sharing some instructions for users to explore the alternatives based on their use case.
So some backends perform worse with jemalloc or tcmalloc? Significantly worse? Anything we should know about?
The change to jemalloc reduced my memory usage by about a factor of 3, pretty significant.
Hi @timstokman, in our previous experiment, we didn't observe worse memory usage with jemalloc or tcmalloc, it just didn't help that much comparing to ONNX and TF frameworks. I think it highly depends on which framework and workload, so we encourage users to experiment with their setup and choose the one that fits.
This issue reports a potential memory leak observed when running NVIDIA Triton Server (v24.09-py3) with model-control-mode=explicit. The server seems to hold onto physical RAM after inference requests are completed, leading to memory exhaustion over time.
I am using triton inference server to host segment anything model by facebook. I have exported the encoder part to ONNX format with this notebook. With the following code, I am running inference only on the sam_encoder model, which is onnx model hosted with triton [24.09].
Here's the system configuration:
I am using following code to test the inference server:
Here is the docker-compose.yml file I am using to run the tritron server:
Here is the input file to test the code.
When I run the code, ite executes properly. But it keep holding physical ram. After the inference is done, it doesn't release the physical memory. With load method, it loads the model into gpu, with unload it realeases the model from gpu. So with GPU we don't have any issue.
But the physical RAM is growing with more requests. Here's the graph showing RAM usage by tritron.
Even if it comes to a saturation points where it dosen't consumes any more ram for the same model (for example: sam_encoder.onnx), it doesn't realease that RAM. So if I try to perform inference with some new other model, it will start consuming that RAM for that model also. Eventually it consumes all the RAM of the machine and leads to frozen state. If we delete the tritron container, it instantly releases the memory.