triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.28k stars 1.47k forks source link

[Bug] Model 'ensemble' receives inputs originated from different decoupled models #7275

Open michaelnny opened 5 months ago

michaelnny commented 5 months ago

Description In a ensemble pipeline for TensorRT-LLM backend, when we try to propagate data from preprocessing model to the postprocessing model, we get this error Model 'ensemble' receives inputs originated from different decoupled models

Here's a summary of the problem:

Triton Information NVIDIA Release 24.04 (build 90085495) Triton Server Version 2.45.0

Using the Triton image on a container

To Reproduce

Step 1: Enable decouped mode inside the tensorrt_llm\config.pbtxt:

name: "tensorrt_llm"
backend: "tensorrtllm"
max_batch_size: 64

model_transaction_policy {
  decoupled: true
}

Step 2: Add a new input field to the postprocessing model inside the postprocessing\config.pbtxt

name: "postprocessing"
backend: "python"
max_batch_size: 64
input [
  {
    name: "INPUT_TOKENS"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  ...
]

Step 3: Try to inherited data from the preprocessing model/step inside the ensemble\config.pbtxt

ensemble_scheduling {
  step [
    {
      model_name: "preprocessing"
      model_version: -1
      ...
      output_map {
        key: "INPUT_ID"
        value: "_INPUT_ID"
      }
      ...
    },
    {
      model_name: "tensorrt_llm"
      model_version: -1
      ...
    },
    {
      model_name: "postprocessing"
      model_version: -1
      input_map {
        key: "INPUT_TOKENS" # add a new field which was propagated from "preprocessing"
        value: "_INPUT_ID"
      }
      ...
    }
  ]
}

Step 4: Start the Triton server, and we get the following error, which cause the server to shutdown.

E0525 08:29:56.598979 93 model_repository_manager.cc:579] Invalid argument: in ensemble ensemble, step of model 'ensemble' receives inputs originated from different decoupled models

Expected behavior Should be able to propagate data with in the ensemble pipeline without needing to disable decouped mode, since we need to use streaming feature.

adrtsang commented 2 months ago

I have the exact problem trying to propagate the input to postprocessing model in the ensemble pipeline and getting the same error. Have you found a solution yet?

michaelnny commented 2 months ago

@adrtsang no solution or workaround, I've moved on and using vLLM now, which seems to have much better community support.

adrtsang commented 2 months ago

I am using an ensemble pipeline that serves a TensorRT model. The input to the pre-processing model is an image volume, and this needs to also be propagated to the post-processing model. Both the pre- and post-processing models have decoupled = True for the model_transaction_policy. Here's the config.pbtxt for the ensemble pipeline:

name: "ensemble_model"
platform: "ensemble"
max_batch_size: 0
input [
  {
    name: "input_image"
    data_type: TYPE_FP16
    dims: [ -1, -1, -1 ]
  }
]
output [
  {
    name: "postprocessed_image"
    data_type: TYPE_FP16
    dims: [ -1, -1, -1 ]
  }
]

ensemble_scheduling {
  step [
    {
      model_name: "whs_preprocess_model"
      model_version: -1
      input_map {
        key: "whs_preprocess_model_input_image"
        value: "input_image"
      }
      output_map {
        key: "whs_preprocess_model_patch"
        value: "whs_image_patch"
      }
      output_map {
        key: "whs_preprocess_model_pad_dimension"
        value: "whs_res_arr_pad_dimension"
      }
      output_map {
        key: "whs_preprocess_model_padding"
        value: "whs_slicer_to_padding"
      }
      output_map {
        key: "whs_preprocess_model_slicer"
        value: "whs_slicer"
      }
      output_map {
        key: "whs_preprocess_model_background"
        value: "whs_background_indices"
      }
    },
    {
      model_name: "whs_model"
      model_version: -1
      input_map {
        key: "input"
        value: "whs_image_patch"
      }
      output_map {
        key: "output"
        value: "whs_model_prediction"
      }
    },
    {
      model_name: "whs_postprocess_model"
      model_version: -1
      input_map {
        key: "whs_postprocess_input_prediction"
        value: "whs_model_prediction"
      }
      input_map {
        key: "whs_postprocess_model_pad_dimension"
        value: "whs_pad_dimension"
      }
      input_map {
        key: "whs_postprocess_model_padding"
        value: "whs_slicer_to_padding"
      }
      input_map {
        key: "whs_postprocess_model_slicer"
        value: "whs_slicer"
      }
      input_map {
        key: "whs_postprocess_model_background"
        value: "whs_background_indices"
      }
      input_map {
        key: "whs_postprocess_input_image"
        value: "input_image"
      }
      output_map {
        key: "whs_postprocessed_output"
        value: "postprocessed_image"
      }
    }
  ]
}

When loading this ensemble pipeline in the Triton server (nvcr.io/nvidia/tritonserver:24.07-py3), it gives an error:

E0903 17:36:36.480489 1 model_repository_manager.cc:614] "Invalid argument: in ensemble ensemble_model, step of model 'ensemble_model' receives inputs originated from different decoupled models"

How can I propagate the input image to both pre- and post-processing models?

michaelnny commented 1 month ago

If I remember correctly, it does not work if we enable decouped mode in the ensemble model pipeline, not sure if this was by design or something.

You can try to disable it and see if it works, but doing so will also disable "streaming"

lakshbhasin commented 1 month ago

I had the same receives inputs originated from different decoupled models issue and was able to resolve it. In my case, the issue was that I had used the same key name as an input in different models, but this doesn't work correctly with decoupled models.

Here is a simplified config.pbtxt example that was causing this error for me. Note:


The issue here is that the `last_stage` takes both `INFERENCE_OUTPUT` and `INPUT_JSON` as inputs. This works fine in non-decoupled mode as we can just pass along `INPUT_JSON` from the overall ensemble input.

However, in decoupled mode, it seems like the `last_stage` (not itself decoupled, but following the decoupled `inference` stage) needs all its inputs provided by the previous decoupled `inference` stage.  Otherwise, `last_stage` is receiving its inputs from different models: one that is decoupled and provides `INFERENCE_OUTPUT`, and one that is just from the input for `INPUT_JSON`.

This seems like a bug in Triton Inference Server. **I worked around this by essentially having the second decoupled `inference` stage output all the tensors needed by the next non-decoupled stage.** So the modified config.pbtxt makes the following changes:
```protobuf
# Other fields unchanged
ensemble_scheduling {
    # First stage "fetch_signals" is unchanged
    {...}
    # Decoupled "inference" stage re-outputs request JSON
    { 
      model_name: "inference"
      # All fields unchanged but add:
      output_map {
        # inference stage model.py will output "INFERENCE_INPUT_JSON"
        # which will be the same value as "INPUT_JSON"
        key: "INFERENCE_INPUT_JSON" 
        # Use a different output name than "INPUT_JSON" to fix the bug
        value: "INFERENCE_REQUEST_INPUT_JSON"
      }
    },
    # Not decoupled "last_stage" is changed to modify the input:
    {
      model_name: "last_stage"
      # All other fields unchanged but modify:
      input_map {
        key: "LAST_STAGE_INPUT_JSON"
        value: "INFERENCE_REQUEST_INPUT_JSON"  # Modified name
      }
    }
  ]
}

Essentially, the fix is for the decoupled stage to re-output INPUT_JSON as INFERENCE_INPUT_JSON, which the next stage then parses as INFERENCE_REQUEST_INPUT_JSON -> (rename) LAST_STAGE_INPUT_JSON.

This fix ensures all the inputs to the last_stage are produced by the previous decoupled inference stage and can be consumed at the same time.

In addition to the ensemble config.pbtxt change above, I also had to change the config.pbtxt for the inference stage to output INFERENCE_INPUT_JSON as well, and change its model.py code to fetch the input tensor and re-output it as INFERENCE_INPUT_JSON.

Hope this helps you demystify the cause of this bug in your case.

adrtsang commented 1 month ago

Hi @lakshbhasin, I am able to implement the workaround above and got further in my ensemble pipeline. Thank you! However, I run into an issue where the output from the pre-processing model are not passing properly to the next stage (inference) in the pipeline. I get the following error when I run my client code: tritonclient.utils.InferenceServerException: in ensemble 'whs_ensemble_model', [request id: 1] input byte size mismatch for input 'whs_inference_input_background' for model 'whs_inference_model'. Expected 6658560, got 0 I return the outputs in the pre-processing model.py script using inference_response = pb_utils.InferenceResponse(output_tensors=[output1, output2, output3]) Any idea why zero bytes of input data are passed to the next stage in the pipeline and caused this exception?