[Question] Have you plan to add output validation in the future?

cmpark0126 commented 11 months ago

Background

I use shared memory output tensor as input tensor to use cuda async memcpy on python backend like below.

def get_torch_tensor(self, request, name):
    tensor_ = pb_utils.get_input_tensor_by_name(request, name)
    if tensor_.is_cpu():
        raise NotImplementedError
    else:
        input_dp = tensor_.to_dlpack()
        input_cp = cupy.fromDlpack(input_dp)
        return torch.as_tensor(input_cp, device=self.dev_id)

@nvtx.annotate("execute()", color="purple")
def execute(self, requests):  # type: ignore
    responses = []

    for request in requests:
        in_0 = self.get_torch_tensor(request, "INPUT0")
        in_1 = self.get_torch_tensor(request, "INPUT1")
        out_0 = self.get_torch_tensor(request, "OUTPUT0") # use output tensor as input tensor

        out = torch.matmul(in_0, in_1)
        out_0.copy_(out, non_blocking=True) # use async cuda memcpy here

        inference_response = pb_utils.InferenceResponse(
            output_tensors=[] # not to return output
        )
        responses.append(inference_response)

    return responses

I decided not to return output here to avoid memcpy called by the triton server automatically.

FYI, I use config.pbtxt written below. The reason why I use output config is I want to cover any case of inference simultaneously (e.g., pass numpy tensor directly without shared memory, pass shared memory)

name: "matmul"
backend: "python"

input [
{
  name: "INPUT0"
  data_type: TYPE_FP32
  dims: [ -1, -1 ]
}
]
input [
{
  name: "INPUT1"
  data_type: TYPE_FP32
  dims: [ -1, -1 ]
}
]
input [
{
  name: "OUTPUT0"
  data_type: TYPE_FP32
  dims: [ -1, -1 ]
  optional: true
}
]
output [
{
  name: "OUTPUT0"
  data_type: TYPE_FP32
  dims: [ -1, -1 ]
}
]

instance_group [
{
  count: 1
  kind: KIND_GPU
  gpus: [0]
}
]
parameters: { key: "FORCE_CPU_ONLY_INPUT_TENSORS" value: {string_value:"no"}}

Question

This way is a little bit hacky. So, I want to know whether Triton keep this policy or add output validation in the future.
- If there is a plan to add output validation in the future, I need to find another way to support async memcpy.
Also, I would like to know why not to validate output now.

kthui commented 11 months ago

Hi @cmpark0126, on why output_tensors=[] is acceptable for dims: [ -1, -1 ] output dimensions. If a dimension is -1, then Triton will accept any value greater-or-equal-to 0 for the -1 dimension. See the following statement from model configuration documentation:

For example, if a model requires a 2-dimensional input tensor where the first dimension must be size 4 but the second dimension can be any size, the model configuration for that input would include dims: [ 4, -1 ]. Triton would then accept inference requests where that input tensor's second dimension was any value greater-or-equal-to 0.

On your use case, [-1, -1] dimensions include [0, 0] dimensions, which [0, 0] dimensions contain nothing. I think this is why your setup worked.

cc @tanmayv25 @nnshah1 on this is an interesting use case to output results using an input tensor on shared memory.

cmpark0126 commented 11 months ago

@kthui First of all, thank you for the kind comment here!

As you said, [0, 0] dimensions are acceptable on my setup, equivalent to empty output. However, it still works when I use fixed output dimensions like the below:

name: "matmul"
backend: "python"

....
input [
  {
    name: "OUTPUT0"
    data_type: TYPE_FP32
    # FYI, If I use the wrong dimensions(e.g., [10000, 10000]) and send [1000, 1000] shaped tensor,
    # triton makes an exception.
    dims: [ 1000, 1000 ] 
    optional: true
  }
]
output [
  {
    name: "OUTPUT0"
    data_type: TYPE_FP32
    # I try to use fixed dimensions
    dims: [ 1000, 1000 ]
  }
]

instance_group [
{
    count: 1
    kind: KIND_GPU
    gpus: [0]
}
]
parameters: { key: "FORCE_CPU_ONLY_INPUT_TENSORS" value: {string_value:"no"}}

I used the same Python Backend code (use output_tensors=[])
Here, I use a fixed output configuration, which does not contain zero dims(e.g., [0, 0])
However, It still works, and there is no error or warning.

kthui commented 11 months ago

Thanks for the update. If the model may or may not output any responses, would you be able to use Decoupled mode? Under decoupled mode, the model may return zero or more responses per request as needed. I think this should provide enough flexibility for your use case without worrying about output validation.

@Tabrizian @krishung5 do you know if it is ok or not for Python backend models to send empty responses?

for request in requests:
    inference_response = pb_utils.InferenceResponse(
        output_tensors=[] # not to return output    <----- empty output tensors here
    )
    responses.append(inference_response)

with dims: [ 1000, 1000 ] output shape on model config.

Tabrizian commented 10 months ago

The outputs returned by the model do not have to conform to the model configuration. If you want to make sure that the outputs returned by the model confirm to the model configuration you can modify your model.py to do that. This is due to performance reasons (i.e., not to slowdown the inferences with these checks on every inference).

dyastremsky commented 8 months ago

Closing issue due to inactivity. Please reopen if you would like to follow up with this issue.

triton-inference-server / server

[Question] Have you plan to add output validation in the future? #6641

Background

Question