Open wxthu opened 1 year ago
Hi @wxthu, can you share the model config.pbtxt
? If a model only has one instance and with dynamic batching disabled, it could be executing sequentially.
Hi @wxthu, can you share the model
config.pbtxt
? If a model only has one instance and with dynamic batching disabled, it could be executing sequentially.
Thanks. The following are my configs:
name: "convnext"
backend: "tensorrt"
max_batch_size: 4
input [
{
name: "begin"
data_type: TYPE_FP32
format: FORMAT_NCHW
dims: [ 3, 224, 224 ]
}
]
output [
{
name: "output"
data_type: TYPE_FP32
dims: [ 1000 ]
label_filename: "convnext_labels.txt"
}
]
instance_group [
{
count: 1
kind: KIND_GPU
}
]
name: "vgg19"
backend: "tensorrt"
max_batch_size: 4
input [
{
name: "begin"
data_type: TYPE_FP32
format: FORMAT_NCHW
dims: [ 3, 224, 224 ]
}
]
output [
{
name: "output"
data_type: TYPE_FP32
dims: [1000]
label_filename: "vgg_labels.txt"
}
]
instance_group [
{
count: 1
kind: KIND_GPU
}
]
Additionally, what I would like to understand is, how is dynamic batching related to multiple requests of the same model and among multiple different models?。When there are simultaneous requests for four different models, how should I enable parallel inference for the four models
Actually, I wonder that whether parallel execution of different model could be supported . If yes, how can I enable that; if not, why. Thanks so much
I think they are in parallel by default, since they are different models. Did you find it otherwise?
I think they are in parallel by default, since they are different models. Did you find it otherwise?
Fine, I found no multiple processes in triton and let me check whether there are multiple streams. By the way, I really found i takes much longer using TritonClient async infer API than that of sync API And I use async inference api to simulate concurrent requests for different models, do you think it works? @kthui
Description I am building a baseline for my engineering project. I want to send multiple request to multiple model and enable parallel executions when different models receives request simultaneously. But when I used the example script to do that, I found that no parallel execution works and the latency of async api was obviously longer that of sync api. Could you please give me some ideas? Thanks so much.
Following is my client script:
Triton Information r23.07, build it myself tensorrt backend