Open oleks-popovych opened 1 year ago
Hi @oleks-popovych, is Triton using more RAM than your model would use outside Triton? What is the backend that you are using?
Hi @Tabrizian Thank you for replying.
Yes, triton consumes a lot more compare to our previous solution, which is execution of tensorflow code in our custom made pipelines.
The backend is tensorflow savedmodel.
A little bit more context, when packed model has size around 600 Mb. VRAM usage is approx 3-3.5 Gb. Input size image size is like 512x350x3 of uint8. Max batch size is 2, but tweeking it has no effect or negligible effect. Also I'm applying automixed precision optimization
I'm using tritonserver - 22.10.
I'm quite certain there no memory leaks since memory doesn't grow, after warm up it stays the same, but still putting more model become problematic
@tanmayv25 Any ideas what could be the issue here?
As per my understanding, Triton by itself would not be consuming such an extent of memory. @oleks-popovych Do you see such a high memory consumption just after loading the model? Or does the memory consumption increase when you send inference requests? Can you share your model (or any dummy model) which demonstrates the different memory consumption within/outside Triton?
I have similar problem. I have tensorflow model (250M saved on disk).
My observations on (CPU) RAM usage: No GPU
with 1 GPU
with 2 GPU
GPU RAM usage is 1.6G
I don't understand why is such RAM even needed. Inputs are single normal size images. We have also similar experience with tensorflow-serving. I can try some dummy model later. Is this expected behaviour? Is it avoidable?
@thran TensorFlow session allocates memory pools for caching purposes. I suspect that is what is causing the memory consumption to increase. I wonder if there are any hooks that TensorFlow exposes for us to restrict or tweak CPU memory allocations. For limiting the GPU memory, you can use: gpu-memory-fraction.
@tanmayv25 its funny, but it's I actually okay with GPU memory consumption.
@tanmayv25 its funny, but it's I actually okay with GPU memory consumption.
same here, I have 13G unused memory on GPU but CPU RAM is full
@oleks-popovych
Yes, triton consumes a lot more compare to our previous solution, which is execution of tensorflow code in our custom made pipelines.
Can you describe how your custom made pipelines were loading the TF model and serving it? Triton's TF backend load and serves models in ways similar to tensorflow-serving. If your custom solution was using less RAM then I see there is a scope of improvement.
A couple of questions to understand the difference:
So far I have not been able to find any TF hooks to reduce the memory footprint. But comparing your custom solution with Triton's might give us some hint.
@tanmayv25 Following table describe my experiments based on your suggestions: | Experiment | RAM Consumption |
---|---|---|
nothing launched | 412 Mb | |
no automixed precision | 10.9 Gb | |
model with automixed precision | 11.8 Gb | |
automixed + both NUM_INTRA/INFRA_THREADS = 1 | 11.9 Gb |
My custom solution looks like follows:
I had a model trained on TF1. I've switched to TF2, but using it via compatibility layer for TF1 embedded into TF2.
Model is loaded using tf.compat.v1.import_graph_def
, then tf.session
created like folows:
device_count = {"GPU": 1}
per_process_gpu_memory_fraction = 0.3
sess_config = tf.compat.v1.ConfigProto(
allow_soft_placement=True,
gpu_options=tf.compat.v1.GPUOptions(per_process_gpu_memory_fraction=per_process_gpu_memory_fraction),
device_count=device_count,
)
session = tf.compat.v1.Session(graph=graph, config=sess_config)
Then it later used like this:
feed = {self.graph_input_tensor: input_batch}
output = self.list_output_tensors
results = self.session.run(output, feed_dict=feed)
With regards to TF_GRAPH_TAG
and TF_SIGNATURE_DEF
, I'm not using those signatures in my custom solution. But when I'm preparing model for TRITON, I'm using default signatures from this tutorial
To summarize:
--backend-config=tensorflow,allow-soft-placement
, --backend-config=tensorflow,gpu-memory-fraction
@oleks-popovych Thanks for sharing these results. Can you also include the RAM consumption of your custom solution in the table? One more question, I just want to confirm that you are comparing against identical TF version between triton and your custom solution.
Same problem here. https://github.com/triton-inference-server/server/issues/5392
In triton server, 54.2% 16.9G
In tensorflow serving, 25.4%, 8.1G, about half of triton server.
triton server version: 22.12
tensorflow serving version
root@aeb36fd52821:/# tensorflow_model_server --version
TensorFlow ModelServer: 2.4.1-rc4
TensorFlow Library: 2.4.1
tensorflow serving cmd to start docker
docker run \
-d \
--gpus all \
-e CUDA_VISIBLE_DEVICES=$GPU_INDEX \
-e TF_FORCE_GPU_ALLOW_GROWTH='true' \
-p $PORT:8501 \
-t <harbor>:<tag> \
--model_config_file=/models/gpu_model_config.config
Is your feature request related to a problem? Please describe. So I'm trying to use tritonserver in my project. But it uses a lot of RAM for a single model.
Describe the solution you'd like I'd like to collect list of tips and trick that actually helps to reduce memory footprint.
Describe alternatives you've considered I've rigorously checked documentation and github issues.
Additional context My setup looks like as follows:
EC2 g4dn.large
(4 vCPU, 16 RAM, Nvidia T4). While deploying model consumes like 10Gb of RAM. Should I consider using instances with more RAM?