briedel commented 2 months ago

Description

Triton does not clear or release GPU memory when there is a pause in inference. In the attached diagrams the same model is being used. It is served via ONNX.

image (1)

image (2)

The GPU memory keeps growing when I do not use the TensorRT optimizations to limit the workspace memory. With the TensorRT optimization it limits the memory, but not fully. When I limit it to 7 GB, 11 GB of GPU memory are used, but that stays constant.

My model config is:

name: "model_name"
platform: "onnxruntime_onnx"
backend: "onnxruntime"
default_model_filename: "model.onnx"
version_policy: {
      all: {}
}
instance_group [
  {
    count: 2
    kind: KIND_GPU
  }
]
max_batch_size: 1000

dynamic_batching { }

Triton Information 24.06-py3

Are you using the Triton container or did you build it yourself?

Container

To Reproduce

Expected behavior A clear and concise description of what you expected to happen.

susnato commented 2 months ago

Facing the same issue with 24.01 and 24.08

Goga1992 commented 2 months ago

Facing the same issue with 24.01 and 24.08

Try using Jemalloc instead of Malloc.

Jemalloc quickly returns allocated memory

anass-al commented 2 months ago

Got the same problem when using onnxruntime

CharlesFarhat commented 2 weeks ago

Same here, is there any news ?

using ONNX backend and a bge m3 onnx model. I fill up the gpu memory and never release...

susnato commented 2 weeks ago

Hi, can you please add this line at the end of config.pbtxt in triton and restart the server and see if this works or not?

parameters { key: "enable_mem_arena" value: { string_value: "1" } }
parameters { key: "memory.enable_memory_arena_shrinkage" value: { string_value: "gpu:0" } }

I might be wrong with the syntax here but basically you need to turn on the shrinkage for Cuda memory arena.

Please let me know if this works otherwise I will test it myself tomorrow and let you know the correct syntax which worked for me in past.

CharlesFarhat commented 2 weeks ago

let me try rn !

CharlesFarhat commented 2 weeks ago

Ok this works amazingly well ! Thank you so much.

Quick questions :

As I understand it was not a bug, just the ORT engine keep memory full and allocate inside. Is it better to keep it that way or your params are better on ?

I have a config with a ensemble model:


name: "ensemble_bge-m3"
# max_batch_size: 8  # maximum batch size
platform: "ensemble"  # ensemble model architecture is used to chain tokenization with model inference

input to the ensemble model

input [ { name: "TEXT" data_type: TYPE_STRING dims: [ -1 ]

-1 means dynamic seq length, aka this dimension may change at runtime

]

output of the ensemble model

output [

these values for the outputs must match those in the output mapping of the last model in the ensemble pipeline

{
    name: "dense_embeddings"
    data_type: TYPE_FP16
    dims: [ -1, 1024 ]
    # 3 dimensional tensor, where 1st dimension: batch-size
},
{
    name: "colbert_embeddings"
    data_type: TYPE_FP16
    dims: [ -1, -1, 1024 ]
    # 4 dimensional tensor, where 1st dimension: batch-size
}

]

Type of scheduler to be used

ensemble_scheduling { step [ # each step corresponds to a step in the ensemble pipeline

Tokenization step

    {
        model_name: "tokenizer_bge-m3"  # must match with the model_name defined in the tokenizer config
        model_version: -1 # -1 means use latest version
            input_map {
            key: "TEXT"  # key is the name of the input of the ensemble model
            value: "TEXT" # value is the name of the input that we are setting to the current step (in the model.py for the tokenizer, we will glob for this input name)
        }
        output_map [
            {
                key: "input_ids"  # key is the name of the tensor output of the tokenizer
                value: "input_ids" # value is the name of the tensor output we are setting for the tokenizer
            },
            {
                key: "attention_mask"
                value: "attention_mask"
            }
        ]
    },
    # Model Inference step
    {
        model_name: "bge-m3" # must match with the model_name defined in the bge-m3 model config
        model_version: -1 # -1 means use latest version
        input_map [
            # these input maps maps the name of the inputs to the current step to the name of the outputs of the previous step
            {
                key: "input_ids"  # key is the name of the output of the previous step
                value: "input_ids" # value is the name of the input of the current step
            },
            {
                key: "attention_mask"
                value: "attention_mask"
            }
        ]
        output_map [
            # these output maps maps the name of the outputs of the current step to the name of the outputs of the current model
            {
                key: "sentence_embedding" # key is the name of the output of the current model
                value: "dense_embeddings" # value is the name of the output of the current step
            },
            {
                key: "token_embeddings" # key is the name of the output of the current model
                value: "colbert_embeddings" # value is the name of the output of the current step
            }
        ]
    }
]

}

and

name: "bge-m3" platform: "onnxruntime_onnx" backend: "onnxruntime" default_model_filename: "model_fp16.onnx"

if tensorrt engine is used instead of onnx, the following platform, backend and default_model_filename should be used instead

platform: "tensorrt_plan"

backend: "tensorrt"

default_model_filename: "model.plan"

max_batch_size: 12 input [ { name: "input_ids" # must match with ensemble model inputting mapping data_type: TYPE_INT64 # is int_64 for onnx dims: [ -1 ] # we leave it as -1 to indicate that seq_len is dynamic. Note that we do not specify batch dimension. }, { name: "attention_mask" # must match with ensemble model inputting mapping data_type: TYPE_INT64 # is int_64 for onnx dims: [ -1 ] # we leave it as -1 to indicate that seq_len is dynamic. Note that we do not specify batch dimension. } ] output [ { name: "sentence_embedding" # must match with the names of the output tensors defined in the onnx model file data_type: TYPE_FP16 # must match with the data type of the output tensor defined in the onnx model file dims: [ 1024 ] # must match with the dimension of the output tensor defined in the onnx model file }, { name: "token_embeddings" # colbert vector output from BGE-M3 model data_type: TYPE_FP16 dims: [ -1, 1024 ] # colbert will generate a vector per token with 1024 dimensions each for each batch element } ]

instance_group [ { count: 1 # number of instances of the model to run on the Triton inference server kind: KIND_GPU # or KIND_CPU, corresponds to whether model is to be run on GPU or CPU } ] dynamic_batching { max_queue_delay_microseconds: 10000 }

parameters { key: "enable_mem_arena" value: { string_value: "1" } } parameters { key: "memory.enable_memory_arena_shrinkage" value: { string_value: "gpu:0" } }

name: "tokenizer_bge-m3" backend: "python" # set to python because we are loading and pre-processing the input for tokenizers in python

max_batch_size: 12 # must match the max_batch_size of the ensemble model config input [

the input we specify here must match the input specified for the ensemble model config

{ name: "TEXT" data_type: TYPE_STRING dims: [ -1 ] } ]

output [

the outputs we specify here must match the outputs specified in the input-output mapping of the ensemble model config

{ name: "input_ids" data_type: TYPE_INT64 # this the datatype required by onnx models dims: [ -1 ] # -1 means the seq_len can be variable. Total number of dimensions must be 1 (we do not specify batch

dimensions in order leverage on Triton server's dynamic batching)

}, { name: "attention_mask" data_type: TYPE_INT64 # this the datatype required by onnx models dims: [ -1 ] # -1 means the seq_len can be variable. Total number of dimensions must be 1 (we do not specify batch

dimensions in order leverage on Triton server's dynamic batching)

} ]

instance_group [ { count: 1 # number of instances of tokenizer kind: KIND_GPU # or KIND_CPU, corrresponding to where the tokenizer is to be run on } ]



Am I doing something wrong with batching that could lead to such memory overload ? I tryed sending batch of 8 and it crashed somehow and if going up to 9 and back to 8 it works....

susnato commented 2 weeks ago

As I understand it was not a bug, just the ORT engine keep memory full and allocate inside. Is it better to keep it that way or your params are better on ?

Yes, "it's a feature not a bug" 😅

I depends on what you are trying to do, if I am in a production cluster, I already know that I am going to use the nodes "only" for deploying my model, so I will keep the settings as it is.

If I am on an non-production machine where I need to constantly deploy more than one model at a time, then it would be "very" bad if a single model takes up that much amount of cuda memory and never release it! I would probably use the config in that case.

Am I doing something wrong with batching that could lead to such memory overload ? I tryed sending batch of 8 and it crashed somehow and if going up to 9 and back to 8 it works....

Yes, it's expected, I am explaining the reason according to my understanding -

Suppose you need to infer a batch of 8 and for that you need 4GB of cuda memory so at that time ort will allocate that much amount, now let's suppose that we didn't turn on the shrinkage so it will hold on to that memory and NOT release it -- now you need to infer a batch of 9 examples and that requires 4.2GB of memory, then it will reallocate "additional" 4.2GB of cuda memory (even if we already have 4GB of memory already allocated it will not re-use it).

But if you already have infered for a batch size of 9 or already allocated 4.2GB of memory after this event if you try to pass a batch of 8, which requires less than what is previously allocated, then it will "reuse" that already allocated amount.

That's why you are getting this error. I have no idea on why this happens but it's maybe due to better and efficient cuda memory allocation.

susnato commented 2 weeks ago

I took a quick look over your config files and since you are using default settings, you are actually allocating more memory than you need (in the power of 2), so can you also add this line to the end of the config file too and restart the server?

parameters {{ key: "arena_extend_strategy" value: {{ string_value: "1" }} }}

This should drastically reduce the amount of memory you allocate for each of your infer calls. and you should be able to use batch size of 50+ (considering you had 8 as max before)

CharlesFarhat commented 2 weeks ago

Hello !

Just tried it out ! It works perfectly, can't go to batch size of 80 but went from 8 to 24 !

If I understand correctly it will allocate exactly what is requested and not power of 2. How did you find out this param ? Is there other params like this one (basic need to know) I need to know about ? Is it ok to use it in prod env ?

We're in a dev env for now but going to production next week, I'll change the two enable_mem_arena params then. If I understood correctly your explanation, without memory won't be freed but on new batch ORT will internally decide to override past batch with it's own optimized algo.

Anyway thank you so much for your help, it means a lot for me !

susnato commented 2 weeks ago

Hi, I am glad that it worked!

The way I would configure my prod environment is - (considering the max batch size is 24 for your gpu)

Change only the memory allocation strategy - parameters {{ key: "arena_extend_strategy" value: {{ string_value: "1" }} }} and keep the other configs as default
pass a payload with max batch size of 24 just right after the server is started.
Then restrict at the client side so that the client will never be able to post a payload more than 24 examples as a batch

I will keep the memory shrinkage as off because separate memory allocation for each request would increase the latency of the service. And since I will send a payload of max batch size at the start of the server, it will allocate the max cuda memory possible from the start and any other request that comes after the first request will be less than the max batch and that will re-use this already allocated memory thus reducing the latency from that point. Also since we are restricting the client side code, it will never send payload more than 24 examples so it will never throw a OOM error.

If I understood correctly your explanation, without memory won't be freed but on new batch ORT will internally decide to override past batch with it's own optimized algo.

Only if the new batch is less than or equal to the previous batch size.

How did you find out this param ? Is there other params like this one (basic need to know) I need to know about ? Is it ok to use it in prod env ?

Yes, it's ok to use it in production environment, you can find more about the params here

CharlesFarhat commented 1 week ago

Thank you so much ! Wish you a very nice week

triton-inference-server / server

GPU memory is not released by Triton #7594

input to the ensemble model

-1 means dynamic seq length, aka this dimension may change at runtime

output of the ensemble model

these values for the outputs must match those in the output mapping of the last model in the ensemble pipeline

Type of scheduler to be used

Tokenization step

if tensorrt engine is used instead of onnx, the following platform, backend and default_model_filename should be used instead

platform: "tensorrt_plan"

backend: "tensorrt"

default_model_filename: "model.plan"

the input we specify here must match the input specified for the ensemble model config

the outputs we specify here must match the outputs specified in the input-output mapping of the ensemble model config

dimensions in order leverage on Triton server's dynamic batching)

dimensions in order leverage on Triton server's dynamic batching)