Open MatthieuToulemont opened 9 months ago
Hi @MatthieuToulemont , have you tried specifying parameters
for trt model in the config.pbtxt
?
I am not familiar with the intricacies of your model though. If you could possibly provide an illustrative example of what you mean, it would be easier for us to find a solution or implement this feature.
Hello, basically I am wondering if, taking the config you shared as example, it is possible to use the parameter execute_delay_ms
as an input to a TensorRT model through the input maps of an ensemble model
For instance:
input {
name: input_0
}
output {
name: output_0
}
ensemble_scheduling{
step {
model_name: "MODEL_1"
model_version: -1
input_map {
key: "input_0"
value: "input_0"
}
input_map {
key: "input_1"
value: "input_1"
}
output_map {
key: "output_0"
value: "output_0"
}
}
}
parameters{
key: "input_1"
value {
string_value: "5"
}
}
In the example above, will the value of input_1
defined in the parameters be forwarded to Model_A through the value mapping ?
Would you possibly consider BLS approach instead of ensemble?
This is definitely possible in BLS.
The issue with BLS models (per my understanding) is that we don't benefit from the ensemble scheduling.
As soon as I put a BLS to manage my ensemble, I will need to wait for one request to be totally processed before another can start being processed, right ?
This is true, when you use inference_request.exec
function that allows you to execute blocking inference requests. You can also explore inference_request.async_exec
, that allows you to perform async
inference requests. This can be useful when you do not need the result of the inference immediately. Using async_exec
function, it is possible to have multiple inflight inference requests and wait for the responses only when needed. Our docs show example for both blocking and non-blocking case.
I wish to have this feature too, directly declare constants in config.pbtxt. I've been using dali backend, a few projects share the same pipeline, only different in several parameters which remain unchanged once service started. Right now I have to rebuild the parameters into pipeline for every project, but with this feature, the pipeline can be configured through pbtxt.
We have opened a ticket to look into this enhancement.
ref: 6179
Related issue: https://github.com/triton-inference-server/server/issues/6561
Hello,
I am using a lot of ensemble models in production and the biggest pain point I have is that in TensorRT it is impossible to index tensors when the index is an input.
Hence to bypass this I have to create a one_hot tensor, multiply and sum to fetch the value I want.
This is very slow. For instance doing so on a diffusion loop of 20 steps with DPM++ you lose 10ms per requests on the full diffusion pipeline compared to if you were able to have the parameters available without the one_hot, multiply, sum. This is huge and not shippable for us.
The simplest solution I have found at the moment is to have a model generating those, caching the response, and forwarding to the diffusion loop ensemble model. But this is very convoluted and not very elegant.
Taking the diffusion loop with DPM++, it would be good to be able to define the parameters needed to run the diffusion & scheduler step in the ensemble config.
What do you think of adding this as a feature?