Open briedel opened 2 months ago
Facing the same issue with 24.01
and 24.08
Facing the same issue with
24.01
and24.08
Try using Jemalloc instead of Malloc.
Jemalloc quickly returns allocated memory
Got the same problem when using onnxruntime
Same here, is there any news ?
using ONNX backend and a bge m3 onnx model. I fill up the gpu memory and never release...
Hi, can you please add this line at the end of config.pbtxt in triton and restart the server and see if this works or not?
parameters { key: "enable_mem_arena" value: { string_value: "1" } }
parameters { key: "memory.enable_memory_arena_shrinkage" value: { string_value: "gpu:0" } }
I might be wrong with the syntax here but basically you need to turn on the shrinkage for Cuda memory arena.
Please let me know if this works otherwise I will test it myself tomorrow and let you know the correct syntax which worked for me in past.
let me try rn !
Ok this works amazingly well ! Thank you so much.
Quick questions :
As I understand it was not a bug, just the ORT engine keep memory full and allocate inside. Is it better to keep it that way or your params are better on ?
I have a config with a ensemble model:
name: "ensemble_bge-m3"
# max_batch_size: 8 # maximum batch size
platform: "ensemble" # ensemble model architecture is used to chain tokenization with model inference
input [ { name: "TEXT" data_type: TYPE_STRING dims: [ -1 ]
}
]
output [
{
name: "dense_embeddings"
data_type: TYPE_FP16
dims: [ -1, 1024 ]
# 3 dimensional tensor, where 1st dimension: batch-size
},
{
name: "colbert_embeddings"
data_type: TYPE_FP16
dims: [ -1, -1, 1024 ]
# 4 dimensional tensor, where 1st dimension: batch-size
}
]
ensemble_scheduling { step [ # each step corresponds to a step in the ensemble pipeline
{
model_name: "tokenizer_bge-m3" # must match with the model_name defined in the tokenizer config
model_version: -1 # -1 means use latest version
input_map {
key: "TEXT" # key is the name of the input of the ensemble model
value: "TEXT" # value is the name of the input that we are setting to the current step (in the model.py for the tokenizer, we will glob for this input name)
}
output_map [
{
key: "input_ids" # key is the name of the tensor output of the tokenizer
value: "input_ids" # value is the name of the tensor output we are setting for the tokenizer
},
{
key: "attention_mask"
value: "attention_mask"
}
]
},
# Model Inference step
{
model_name: "bge-m3" # must match with the model_name defined in the bge-m3 model config
model_version: -1 # -1 means use latest version
input_map [
# these input maps maps the name of the inputs to the current step to the name of the outputs of the previous step
{
key: "input_ids" # key is the name of the output of the previous step
value: "input_ids" # value is the name of the input of the current step
},
{
key: "attention_mask"
value: "attention_mask"
}
]
output_map [
# these output maps maps the name of the outputs of the current step to the name of the outputs of the current model
{
key: "sentence_embedding" # key is the name of the output of the current model
value: "dense_embeddings" # value is the name of the output of the current step
},
{
key: "token_embeddings" # key is the name of the output of the current model
value: "colbert_embeddings" # value is the name of the output of the current step
}
]
}
]
}
and
name: "bge-m3" platform: "onnxruntime_onnx" backend: "onnxruntime" default_model_filename: "model_fp16.onnx"
max_batch_size: 12 input [ { name: "input_ids" # must match with ensemble model inputting mapping data_type: TYPE_INT64 # is int_64 for onnx dims: [ -1 ] # we leave it as -1 to indicate that seq_len is dynamic. Note that we do not specify batch dimension. }, { name: "attention_mask" # must match with ensemble model inputting mapping data_type: TYPE_INT64 # is int_64 for onnx dims: [ -1 ] # we leave it as -1 to indicate that seq_len is dynamic. Note that we do not specify batch dimension. } ] output [ { name: "sentence_embedding" # must match with the names of the output tensors defined in the onnx model file data_type: TYPE_FP16 # must match with the data type of the output tensor defined in the onnx model file dims: [ 1024 ] # must match with the dimension of the output tensor defined in the onnx model file }, { name: "token_embeddings" # colbert vector output from BGE-M3 model data_type: TYPE_FP16 dims: [ -1, 1024 ] # colbert will generate a vector per token with 1024 dimensions each for each batch element } ]
instance_group [ { count: 1 # number of instances of the model to run on the Triton inference server kind: KIND_GPU # or KIND_CPU, corresponds to whether model is to be run on GPU or CPU } ] dynamic_batching { max_queue_delay_microseconds: 10000 }
parameters { key: "enable_mem_arena" value: { string_value: "1" } } parameters { key: "memory.enable_memory_arena_shrinkage" value: { string_value: "gpu:0" } }
name: "tokenizer_bge-m3" backend: "python" # set to python because we are loading and pre-processing the input for tokenizers in python
max_batch_size: 12 # must match the max_batch_size of the ensemble model config input [
{ name: "TEXT" data_type: TYPE_STRING dims: [ -1 ] } ]
output [
{ name: "input_ids" data_type: TYPE_INT64 # this the datatype required by onnx models dims: [ -1 ] # -1 means the seq_len can be variable. Total number of dimensions must be 1 (we do not specify batch
}, { name: "attention_mask" data_type: TYPE_INT64 # this the datatype required by onnx models dims: [ -1 ] # -1 means the seq_len can be variable. Total number of dimensions must be 1 (we do not specify batch
} ]
instance_group [ { count: 1 # number of instances of tokenizer kind: KIND_GPU # or KIND_CPU, corrresponding to where the tokenizer is to be run on } ]
Am I doing something wrong with batching that could lead to such memory overload ? I tryed sending batch of 8 and it crashed somehow and if going up to 9 and back to 8 it works....
As I understand it was not a bug, just the ORT engine keep memory full and allocate inside. Is it better to keep it that way or your params are better on ?
Yes, "it's a feature not a bug" 😅
I depends on what you are trying to do, if I am in a production cluster, I already know that I am going to use the nodes "only" for deploying my model, so I will keep the settings as it is.
If I am on an non-production machine where I need to constantly deploy more than one model at a time, then it would be "very" bad if a single model takes up that much amount of cuda memory and never release it! I would probably use the config in that case.
Am I doing something wrong with batching that could lead to such memory overload ? I tryed sending batch of 8 and it crashed somehow and if going up to 9 and back to 8 it works....
Yes, it's expected, I am explaining the reason according to my understanding -
Suppose you need to infer a batch of 8 and for that you need 4GB of cuda memory so at that time ort will allocate that much amount, now let's suppose that we didn't turn on the shrinkage so it will hold on to that memory and NOT release it -- now you need to infer a batch of 9 examples and that requires 4.2GB of memory, then it will reallocate "additional" 4.2GB of cuda memory (even if we already have 4GB of memory already allocated it will not re-use it).
But if you already have infered for a batch size of 9 or already allocated 4.2GB of memory after this event if you try to pass a batch of 8, which requires less than what is previously allocated, then it will "reuse" that already allocated amount.
That's why you are getting this error. I have no idea on why this happens but it's maybe due to better and efficient cuda memory allocation.
I took a quick look over your config files and since you are using default settings, you are actually allocating more memory than you need (in the power of 2), so can you also add this line to the end of the config file too and restart the server?
parameters {{ key: "arena_extend_strategy" value: {{ string_value: "1" }} }}
This should drastically reduce the amount of memory you allocate for each of your infer calls. and you should be able to use batch size of 50+ (considering you had 8 as max before)
Hello !
Just tried it out ! It works perfectly, can't go to batch size of 80 but went from 8 to 24 !
If I understand correctly it will allocate exactly what is requested and not power of 2. How did you find out this param ? Is there other params like this one (basic need to know) I need to know about ? Is it ok to use it in prod env ?
We're in a dev env for now but going to production next week, I'll change the two enable_mem_arena params then. If I understood correctly your explanation, without memory won't be freed but on new batch ORT will internally decide to override past batch with it's own optimized algo.
Anyway thank you so much for your help, it means a lot for me !
Hi, I am glad that it worked!
The way I would configure my prod environment is - (considering the max batch size is 24 for your gpu)
parameters {{ key: "arena_extend_strategy" value: {{ string_value: "1" }} }}
and keep the other configs as defaultI will keep the memory shrinkage as off because separate memory allocation for each request would increase the latency of the service. And since I will send a payload of max batch size at the start of the server, it will allocate the max cuda memory possible from the start and any other request that comes after the first request will be less than the max batch and that will re-use this already allocated memory thus reducing the latency from that point. Also since we are restricting the client side code, it will never send payload more than 24 examples so it will never throw a OOM error.
If I understood correctly your explanation, without memory won't be freed but on new batch ORT will internally decide to override past batch with it's own optimized algo.
Only if the new batch is less than or equal to the previous batch size.
How did you find out this param ? Is there other params like this one (basic need to know) I need to know about ? Is it ok to use it in prod env ?
Yes, it's ok to use it in production environment, you can find more about the params here
Thank you so much ! Wish you a very nice week
Description
Triton does not clear or release GPU memory when there is a pause in inference. In the attached diagrams the same model is being used. It is served via ONNX.
The GPU memory keeps growing when I do not use the TensorRT optimizations to limit the workspace memory. With the TensorRT optimization it limits the memory, but not fully. When I limit it to 7 GB, 11 GB of GPU memory are used, but that stays constant.
My model config is:
Triton Information 24.06-py3
Are you using the Triton container or did you build it yourself?
Container
To Reproduce
Expected behavior A clear and concise description of what you expected to happen.