triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.14k stars 1.46k forks source link

[Question] Is it possible to have inputs of ensemble models with dimension=1 defined in the config #6696

Open MatthieuToulemont opened 9 months ago

MatthieuToulemont commented 9 months ago

Hello,

I am using a lot of ensemble models in production and the biggest pain point I have is that in TensorRT it is impossible to index tensors when the index is an input.

Hence to bypass this I have to create a one_hot tensor, multiply and sum to fetch the value I want.

This is very slow. For instance doing so on a diffusion loop of 20 steps with DPM++ you lose 10ms per requests on the full diffusion pipeline compared to if you were able to have the parameters available without the one_hot, multiply, sum. This is huge and not shippable for us.

The simplest solution I have found at the moment is to have a model generating those, caching the response, and forwarding to the diffusion loop ensemble model. But this is very convoluted and not very elegant.

Taking the diffusion loop with DPM++, it would be good to be able to define the parameters needed to run the diffusion & scheduler step in the ensemble config.

What do you think of adding this as a feature?

oandreeva-nv commented 9 months ago

Hi @MatthieuToulemont , have you tried specifying parameters for trt model in the config.pbtxt?

For example: https://github.com/triton-inference-server/server/blob/d6bd668cf2208ef70d951182f0fda7d5a7e21c82/docs/examples/model_repository/simple_dyna_sequence/config.pbtxt#L90-L95

oandreeva-nv commented 9 months ago

I am not familiar with the intricacies of your model though. If you could possibly provide an illustrative example of what you mean, it would be easier for us to find a solution or implement this feature.

MatthieuToulemont commented 9 months ago

Hello, basically I am wondering if, taking the config you shared as example, it is possible to use the parameter execute_delay_msas an input to a TensorRT model through the input maps of an ensemble model

MatthieuToulemont commented 9 months ago

For instance:

input {
    name: input_0
}
output {
   name: output_0
}
ensemble_scheduling{
step {
    model_name: "MODEL_1"
    model_version: -1
    input_map {
      key: "input_0"
      value: "input_0"
    }
    input_map {
      key: "input_1"
      value: "input_1"
    }

    output_map {
      key: "output_0"
      value: "output_0"
    }
  }
}

parameters{
key: "input_1"
  value {
    string_value: "5"
  }
}

In the example above, will the value of input_1 defined in the parameters be forwarded to Model_A through the value mapping ?

oandreeva-nv commented 9 months ago

Would you possibly consider BLS approach instead of ensemble?

This is definitely possible in BLS.

MatthieuToulemont commented 9 months ago

The issue with BLS models (per my understanding) is that we don't benefit from the ensemble scheduling.

As soon as I put a BLS to manage my ensemble, I will need to wait for one request to be totally processed before another can start being processed, right ?

oandreeva-nv commented 9 months ago

This is true, when you use inference_request.exec function that allows you to execute blocking inference requests. You can also explore inference_request.async_exec, that allows you to perform async inference requests. This can be useful when you do not need the result of the inference immediately. Using async_exec function, it is possible to have multiple inflight inference requests and wait for the responses only when needed. Our docs show example for both blocking and non-blocking case.

SunXuan90 commented 9 months ago

I wish to have this feature too, directly declare constants in config.pbtxt. I've been using dali backend, a few projects share the same pipeline, only different in several parameters which remain unchanged once service started. Right now I have to rebuild the parameters into pipeline for every project, but with this feature, the pipeline can be configured through pbtxt.

dyastremsky commented 7 months ago

We have opened a ticket to look into this enhancement.

ref: 6179

Related issue: https://github.com/triton-inference-server/server/issues/6561