Closed haiminh2001 closed 7 months ago
In the configuration, deepfake-infer-x (1 to 4) are defined one after another in the ensemble schedule. So, the scheduler considers these steps to be executed in a sequence, as ensemble is just a pipeline which contains the sequence of models to run. Triton has an option to run multiple instances of a model concurrently. There's a comprehensive guide here - https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_2-improving_resource_utilization
Please have a look at the guide and try if your problem can be solved.
In the configuration, deepfake-infer-x (1 to 4) are defined one after another in the ensemble schedule. So, the scheduler considers these steps to be executed in a sequence, as ensemble is just a pipeline which contains the sequence of models to run. Triton has an option to run multiple instances of a model concurrently. There's a comprehensive guide here - https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_2-improving_resource_utilization
Please have a look at the guide and try if your problem can be solved.
Since deepfake-infer-x (1 - 4) are different models, setting multiple instances is not relevant. What I expect that these models depend on and only on the preprocess model's output, therefore as soon as that output is ready, deepfake-infer-x models should be able to do their inference, without waiting for any other deepfake-infer-x.
Here is the example from the official page:
When an inference request for the ensemble model is received, the ensemble scheduler will: 1. Recognize that the “IMAGE” tensor in the request is mapped to input “RAW_IMAGE” in the preprocess model. 2. Check models within the ensemble and send an internal request to the preprocess model because all the input tensors required are ready. 3. Recognize the completion of the internal request, collect the output tensor and map the content to “preprocessed_image” which is an unique name known within the ensemble. 4. Map the newly collected tensor to inputs of the models within the ensemble. In this case, the inputs of “classification_model” and “segmentation_model” will be mapped and marked as ready. 5. Check models that require the newly collected tensor and send internal requests to models whose inputs are ready, the classification model and the segmentation model in this case. Note that the responses will be in arbitrary order depending on the load and computation time of individual models. 6. Repeat step 3-5 until no more internal requests should be sent, and then response to the inference request with the tensors mapped to the ensemble output names.
In step 5, I believe that the responses being in arbitrary order means concurrent executions, sequential executions as you mentioned.
In conclusion, I think my expectation in my issue matches this explanation about ensembles in the official documentation.
@haiminh2001 Let me know if you need more help.
@haiminh2001 Let me know if you need more help.
Thank you for your support. May be I did not make it clear. In short, in contrast to you, I still expect the deepfake-infer-x models to run concurrently, since the behavior from the official page about ensembles said it should.
cc @GuanLuo
For ensemble, the "compute" stats reports the accumulated value of the "compute" stats from composing model. However, if you check the end-to-end request time (nv_inference_request_duration_us
), you should see the ensemble time to be shorter as an indicator of concurrent execution.
For ensemble, the "compute" stats reports the accumulated value of the "compute" stats from composing model. However, if you check the end-to-end request time (
nv_inference_request_duration_us
), you should see the ensemble time to be shorter as an indicator of concurrent execution.
That was also my hypothesis, and I did check the actual request time and it turned out to also equal the total time of every component. But today is weekend so I cannot collect those stats. I will gather it and send it to you as soon as possible.
For ensemble, the "compute" stats reports the accumulated value of the "compute" stats from composing model. However, if you check the end-to-end request time (
nv_inference_request_duration_us
), you should see the ensemble time to be shorter as an indicator of concurrent execution.
I checked the metrics nv_inference_request_duration_us
and it was shorter as you expected. The issue can be closed now.
Thank you guys for your support. Anyway I suggest that the docs could describe what these metrics indicate when it comes to ensemble models.
Description Ensemble models do not run concurrently.
Triton Information Triton Version: 24.01
Are you using the Triton container or did you build it yourself? I use pre-built container from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags. To Reproduce I have an ensemble named deepfake-end-to-end with 6 components: preprocess, deepfake-infer-x with x run from 1 - 4 and postprocess. Each deepfake-infer-x consume the same input which is the output of the preprocess and nothing else. Therefore I expect these models to run concurrently. But when I check the metrics using grafana, I found that the total infer time of the ensemble equals the sum of infer time of every component, which means the components may run sequentially.
This is the config.pbtxt file of the ensemble, preprocess and postprocess uses python backend, deepfake-infer-x models are all TensorrRT
Expected behavior Ensemble models are able to run concurrently, which may lead to the total infer runtime to approximate the infer runtime of the longest component.