Open michaelnny opened 5 months ago
I have the exact problem trying to propagate the input to postprocessing model in the ensemble pipeline and getting the same error. Have you found a solution yet?
@adrtsang no solution or workaround, I've moved on and using vLLM now, which seems to have much better community support.
I am using an ensemble pipeline that serves a TensorRT model. The input to the pre-processing model is an image volume, and this needs to also be propagated to the post-processing model. Both the pre- and post-processing models have decoupled = True for the model_transaction_policy. Here's the config.pbtxt for the ensemble pipeline:
name: "ensemble_model"
platform: "ensemble"
max_batch_size: 0
input [
{
name: "input_image"
data_type: TYPE_FP16
dims: [ -1, -1, -1 ]
}
]
output [
{
name: "postprocessed_image"
data_type: TYPE_FP16
dims: [ -1, -1, -1 ]
}
]
ensemble_scheduling {
step [
{
model_name: "whs_preprocess_model"
model_version: -1
input_map {
key: "whs_preprocess_model_input_image"
value: "input_image"
}
output_map {
key: "whs_preprocess_model_patch"
value: "whs_image_patch"
}
output_map {
key: "whs_preprocess_model_pad_dimension"
value: "whs_res_arr_pad_dimension"
}
output_map {
key: "whs_preprocess_model_padding"
value: "whs_slicer_to_padding"
}
output_map {
key: "whs_preprocess_model_slicer"
value: "whs_slicer"
}
output_map {
key: "whs_preprocess_model_background"
value: "whs_background_indices"
}
},
{
model_name: "whs_model"
model_version: -1
input_map {
key: "input"
value: "whs_image_patch"
}
output_map {
key: "output"
value: "whs_model_prediction"
}
},
{
model_name: "whs_postprocess_model"
model_version: -1
input_map {
key: "whs_postprocess_input_prediction"
value: "whs_model_prediction"
}
input_map {
key: "whs_postprocess_model_pad_dimension"
value: "whs_pad_dimension"
}
input_map {
key: "whs_postprocess_model_padding"
value: "whs_slicer_to_padding"
}
input_map {
key: "whs_postprocess_model_slicer"
value: "whs_slicer"
}
input_map {
key: "whs_postprocess_model_background"
value: "whs_background_indices"
}
input_map {
key: "whs_postprocess_input_image"
value: "input_image"
}
output_map {
key: "whs_postprocessed_output"
value: "postprocessed_image"
}
}
]
}
When loading this ensemble pipeline in the Triton server (nvcr.io/nvidia/tritonserver:24.07-py3), it gives an error:
E0903 17:36:36.480489 1 model_repository_manager.cc:614] "Invalid argument: in ensemble ensemble_model, step of model 'ensemble_model' receives inputs originated from different decoupled models"
How can I propagate the input image to both pre- and post-processing models?
If I remember correctly, it does not work if we enable decouped mode in the ensemble model pipeline, not sure if this was by design or something.
You can try to disable it and see if it works, but doing so will also disable "streaming"
I had the same receives inputs originated from different decoupled models
issue and was able to resolve it. In my case, the issue was that I had used the same key name as an input in different models, but this doesn't work correctly with decoupled models.
Here is a simplified config.pbtxt example that was causing this error for me. Note:
INPUT_JSON
request
name: "minimal_example"
platform: "ensemble"
input [
{
name: "INPUT_JSON"
data_type: TYPE_STRING
dims: [ 1 ]
}
]
output [
{
name: "OUTPUT_IDS"
data_type: TYPE_INT16
dims: [ -1 ]
}
]
ensemble_scheduling {
step [
# Not decoupled
{
model_name: "fetch_signals"
model_version: 1
input_map {
key: "SIGNAL_FETCH_INPUT_JSON"
value: "INPUT_JSON" # Input request JSON
}
output_map {
key: "SIGNAL_FETCH_OUTPUT_SIGNAL"
value: "SIGNAL_FETCH_OUTPUT_SIGNAL"
}
},
# "inference" stage is decoupled
{
model_name: "inference"
model_version: 1
input_map {
key: "INFERENCE_INPUT_SIGNAL"
value: "SIGNAL_FETCH_OUTPUT_SIGNAL"
}
input_map {
key: "INFERENCE_INPUT_JSON"
value: "INPUT_JSON" # Input request JSON
}
output_map {
key: "INFERENCE_OUTPUT"
value: "INFERENCE_OUTPUT"
}
},
# Not decoupled
{
model_name: "last_stage"
model_version: 1
input_map {
key: "LAST_STAGE_INPUT"
value: "INFERENCE_OUTPUT"
}
input_map {
key: "LAST_STAGE_INPUT_JSON"
value: "INPUT_JSON" # Input request JSON
}
output_map {
key: "LAST_STAGE_OUTPUT_IDS"
value: "OUTPUT_IDS"
}
}
]
}
The issue here is that the `last_stage` takes both `INFERENCE_OUTPUT` and `INPUT_JSON` as inputs. This works fine in non-decoupled mode as we can just pass along `INPUT_JSON` from the overall ensemble input.
However, in decoupled mode, it seems like the `last_stage` (not itself decoupled, but following the decoupled `inference` stage) needs all its inputs provided by the previous decoupled `inference` stage. Otherwise, `last_stage` is receiving its inputs from different models: one that is decoupled and provides `INFERENCE_OUTPUT`, and one that is just from the input for `INPUT_JSON`.
This seems like a bug in Triton Inference Server. **I worked around this by essentially having the second decoupled `inference` stage output all the tensors needed by the next non-decoupled stage.** So the modified config.pbtxt makes the following changes:
```protobuf
# Other fields unchanged
ensemble_scheduling {
# First stage "fetch_signals" is unchanged
{...}
# Decoupled "inference" stage re-outputs request JSON
{
model_name: "inference"
# All fields unchanged but add:
output_map {
# inference stage model.py will output "INFERENCE_INPUT_JSON"
# which will be the same value as "INPUT_JSON"
key: "INFERENCE_INPUT_JSON"
# Use a different output name than "INPUT_JSON" to fix the bug
value: "INFERENCE_REQUEST_INPUT_JSON"
}
},
# Not decoupled "last_stage" is changed to modify the input:
{
model_name: "last_stage"
# All other fields unchanged but modify:
input_map {
key: "LAST_STAGE_INPUT_JSON"
value: "INFERENCE_REQUEST_INPUT_JSON" # Modified name
}
}
]
}
Essentially, the fix is for the decoupled stage to re-output INPUT_JSON
as INFERENCE_INPUT_JSON
, which the next stage then parses as INFERENCE_REQUEST_INPUT_JSON
-> (rename) LAST_STAGE_INPUT_JSON
.
This fix ensures all the inputs to the last_stage
are produced by the previous decoupled inference
stage and can be consumed at the same time.
In addition to the ensemble config.pbtxt change above, I also had to change the config.pbtxt for the inference
stage to output INFERENCE_INPUT_JSON
as well, and change its model.py code to fetch the input tensor and re-output it as INFERENCE_INPUT_JSON
.
Hope this helps you demystify the cause of this bug in your case.
Hi @lakshbhasin, I am able to implement the workaround above and got further in my ensemble pipeline. Thank you! However, I run into an issue where the output from the pre-processing model are not passing properly to the next stage (inference) in the pipeline. I get the following error when I run my client code:
tritonclient.utils.InferenceServerException: in ensemble 'whs_ensemble_model', [request id: 1] input byte size mismatch for input 'whs_inference_input_background' for model 'whs_inference_model'. Expected 6658560, got 0
I return the outputs in the pre-processing model.py script using
inference_response = pb_utils.InferenceResponse(output_tensors=[output1, output2, output3])
Any idea why zero bytes of input data are passed to the next stage in the pipeline and caused this exception?
Description In a ensemble pipeline for TensorRT-LLM backend, when we try to propagate data from preprocessing model to the postprocessing model, we get this error Model 'ensemble' receives inputs originated from different decoupled models
Here's a summary of the problem:
Triton Information NVIDIA Release 24.04 (build 90085495) Triton Server Version 2.45.0
Using the Triton image on a container
To Reproduce
Step 1: Enable decouped mode inside the
tensorrt_llm\config.pbtxt
:Step 2: Add a new input field to the
postprocessing
model inside thepostprocessing\config.pbtxt
Step 3: Try to inherited data from the
preprocessing
model/step inside theensemble\config.pbtxt
Step 4: Start the Triton server, and we get the following error, which cause the server to shutdown.
Expected behavior Should be able to propagate data with in the ensemble pipeline without needing to disable decouped mode, since we need to use streaming feature.