triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.24k stars 1.47k forks source link

Suggestion to reduce RAM consumption #5259

Open oleks-popovych opened 1 year ago

oleks-popovych commented 1 year ago

Is your feature request related to a problem? Please describe. So I'm trying to use tritonserver in my project. But it uses a lot of RAM for a single model.

Describe the solution you'd like I'd like to collect list of tips and trick that actually helps to reduce memory footprint.

Describe alternatives you've considered I've rigorously checked documentation and github issues.

Additional context My setup looks like as follows: EC2 g4dn.large (4 vCPU, 16 RAM, Nvidia T4). While deploying model consumes like 10Gb of RAM. Should I consider using instances with more RAM?

Tabrizian commented 1 year ago

Hi @oleks-popovych, is Triton using more RAM than your model would use outside Triton? What is the backend that you are using?

oleks-popovych commented 1 year ago

Hi @Tabrizian Thank you for replying.

Yes, triton consumes a lot more compare to our previous solution, which is execution of tensorflow code in our custom made pipelines.

The backend is tensorflow savedmodel.

A little bit more context, when packed model has size around 600 Mb. VRAM usage is approx 3-3.5 Gb. Input size image size is like 512x350x3 of uint8. Max batch size is 2, but tweeking it has no effect or negligible effect. Also I'm applying automixed precision optimization

I'm using tritonserver - 22.10.

I'm quite certain there no memory leaks since memory doesn't grow, after warm up it stays the same, but still putting more model become problematic

Tabrizian commented 1 year ago

@tanmayv25 Any ideas what could be the issue here?

tanmayv25 commented 1 year ago

As per my understanding, Triton by itself would not be consuming such an extent of memory. @oleks-popovych Do you see such a high memory consumption just after loading the model? Or does the memory consumption increase when you send inference requests? Can you share your model (or any dummy model) which demonstrates the different memory consumption within/outside Triton?

thran commented 1 year ago

I have similar problem. I have tensorflow model (250M saved on disk).

My observations on (CPU) RAM usage: No GPU

with 1 GPU

with 2 GPU

GPU RAM usage is 1.6G

I don't understand why is such RAM even needed. Inputs are single normal size images. We have also similar experience with tensorflow-serving. I can try some dummy model later. Is this expected behaviour? Is it avoidable?

tanmayv25 commented 1 year ago

@thran TensorFlow session allocates memory pools for caching purposes. I suspect that is what is causing the memory consumption to increase. I wonder if there are any hooks that TensorFlow exposes for us to restrict or tweak CPU memory allocations. For limiting the GPU memory, you can use: gpu-memory-fraction.

oleks-popovych commented 1 year ago

@tanmayv25 its funny, but it's I actually okay with GPU memory consumption.

thran commented 1 year ago

@tanmayv25 its funny, but it's I actually okay with GPU memory consumption.

same here, I have 13G unused memory on GPU but CPU RAM is full

tanmayv25 commented 1 year ago

@oleks-popovych

Yes, triton consumes a lot more compare to our previous solution, which is execution of tensorflow code in our custom made pipelines.

Can you describe how your custom made pipelines were loading the TF model and serving it? Triton's TF backend load and serves models in ways similar to tensorflow-serving. If your custom solution was using less RAM then I see there is a scope of improvement.

A couple of questions to understand the difference:

  1. Can you take look at the following backend parameters and compare with your custom solution?
  2. For example, can you set TF_NUM_INTRA_THREADS/TF_NUM_INTER_THREADS to 1 and see if it improves the memory consumption.
  3. Are you using same TF_GRAPH_TAG and TF_SIGNATURE_DEF while serving your models?
  4. What happens when you disable automixed precision optimization? Does it reduce memory footprint?

So far I have not been able to find any TF hooks to reduce the memory footprint. But comparing your custom solution with Triton's might give us some hint.

oleks-popovych commented 1 year ago
@tanmayv25 Following table describe my experiments based on your suggestions: Experiment RAM Consumption
nothing launched 412 Mb
no automixed precision 10.9 Gb
model with automixed precision 11.8 Gb
automixed + both NUM_INTRA/INFRA_THREADS = 1 11.9 Gb

My custom solution looks like follows:

I had a model trained on TF1. I've switched to TF2, but using it via compatibility layer for TF1 embedded into TF2. Model is loaded using tf.compat.v1.import_graph_def, then tf.session created like folows:

device_count = {"GPU": 1}
per_process_gpu_memory_fraction = 0.3
sess_config = tf.compat.v1.ConfigProto(
      allow_soft_placement=True,
      gpu_options=tf.compat.v1.GPUOptions(per_process_gpu_memory_fraction=per_process_gpu_memory_fraction),
      device_count=device_count,
  )

session = tf.compat.v1.Session(graph=graph, config=sess_config)

Then it later used like this:

feed = {self.graph_input_tensor: input_batch}
output = self.list_output_tensors

results = self.session.run(output, feed_dict=feed)

With regards to TF_GRAPH_TAG and TF_SIGNATURE_DEF, I'm not using those signatures in my custom solution. But when I'm preparing model for TRITON, I'm using default signatures from this tutorial

To summarize:

  1. My pipeline is based on importing graph definition and later using it for inference
  2. It seems that I'm using default values provided by TF2 with regards to this parameters and I'm controlling CLI options like: --backend-config=tensorflow,allow-soft-placement, --backend-config=tensorflow,gpu-memory-fraction
tanmayv25 commented 1 year ago

@oleks-popovych Thanks for sharing these results. Can you also include the RAM consumption of your custom solution in the table? One more question, I just want to confirm that you are comparing against identical TF version between triton and your custom solution.

Arashi19901001 commented 1 year ago

Same problem here. https://github.com/triton-inference-server/server/issues/5392