triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.27k stars 1.47k forks source link

[Question] Have you plan to add output validation in the future? #6641

Closed cmpark0126 closed 8 months ago

cmpark0126 commented 11 months ago

Background

Question

kthui commented 11 months ago

Hi @cmpark0126, on why output_tensors=[] is acceptable for dims: [ -1, -1 ] output dimensions. If a dimension is -1, then Triton will accept any value greater-or-equal-to 0 for the -1 dimension. See the following statement from model configuration documentation:

For example, if a model requires a 2-dimensional input tensor where the first dimension must be size 4 but the second dimension can be any size, the model configuration for that input would include dims: [ 4, -1 ]. Triton would then accept inference requests where that input tensor's second dimension was any value greater-or-equal-to 0.

On your use case, [-1, -1] dimensions include [0, 0] dimensions, which [0, 0] dimensions contain nothing. I think this is why your setup worked.

cc @tanmayv25 @nnshah1 on this is an interesting use case to output results using an input tensor on shared memory.

cmpark0126 commented 11 months ago

@kthui First of all, thank you for the kind comment here!

As you said, [0, 0] dimensions are acceptable on my setup, equivalent to empty output. However, it still works when I use fixed output dimensions like the below:

name: "matmul"
backend: "python"

....
input [
  {
    name: "OUTPUT0"
    data_type: TYPE_FP32
    # FYI, If I use the wrong dimensions(e.g., [10000, 10000]) and send [1000, 1000] shaped tensor,
    # triton makes an exception.
    dims: [ 1000, 1000 ] 
    optional: true
  }
]
output [
  {
    name: "OUTPUT0"
    data_type: TYPE_FP32
    # I try to use fixed dimensions
    dims: [ 1000, 1000 ]
  }
]

instance_group [
{
    count: 1
    kind: KIND_GPU
    gpus: [0]
}
]
parameters: { key: "FORCE_CPU_ONLY_INPUT_TENSORS" value: {string_value:"no"}}
kthui commented 11 months ago

Thanks for the update. If the model may or may not output any responses, would you be able to use Decoupled mode? Under decoupled mode, the model may return zero or more responses per request as needed. I think this should provide enough flexibility for your use case without worrying about output validation.

@Tabrizian @krishung5 do you know if it is ok or not for Python backend models to send empty responses?

for request in requests:
    inference_response = pb_utils.InferenceResponse(
        output_tensors=[] # not to return output    <----- empty output tensors here
    )
    responses.append(inference_response)

with dims: [ 1000, 1000 ] output shape on model config.

Tabrizian commented 10 months ago

The outputs returned by the model do not have to conform to the model configuration. If you want to make sure that the outputs returned by the model confirm to the model configuration you can modify your model.py to do that. This is due to performance reasons (i.e., not to slowdown the inferences with these checks on every inference).

dyastremsky commented 8 months ago

Closing issue due to inactivity. Please reopen if you would like to follow up with this issue.