Closed cmpark0126 closed 8 months ago
Hi @cmpark0126, on why output_tensors=[]
is acceptable for dims: [ -1, -1 ]
output dimensions. If a dimension is -1
, then Triton will accept any value greater-or-equal-to 0
for the -1
dimension. See the following statement from model configuration documentation:
For example, if a model requires a 2-dimensional input tensor where the first dimension must be size 4 but the second dimension can be any size, the model configuration for that input would include dims: [ 4, -1 ]. Triton would then accept inference requests where that input tensor's second dimension was any value greater-or-equal-to 0.
On your use case, [-1, -1]
dimensions include [0, 0]
dimensions, which [0, 0]
dimensions contain nothing. I think this is why your setup worked.
cc @tanmayv25 @nnshah1 on this is an interesting use case to output results using an input tensor on shared memory.
@kthui First of all, thank you for the kind comment here!
As you said, [0, 0] dimensions are acceptable on my setup, equivalent to empty output. However, it still works when I use fixed output dimensions like the below:
name: "matmul"
backend: "python"
....
input [
{
name: "OUTPUT0"
data_type: TYPE_FP32
# FYI, If I use the wrong dimensions(e.g., [10000, 10000]) and send [1000, 1000] shaped tensor,
# triton makes an exception.
dims: [ 1000, 1000 ]
optional: true
}
]
output [
{
name: "OUTPUT0"
data_type: TYPE_FP32
# I try to use fixed dimensions
dims: [ 1000, 1000 ]
}
]
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [0]
}
]
parameters: { key: "FORCE_CPU_ONLY_INPUT_TENSORS" value: {string_value:"no"}}
output_tensors=[]
)Thanks for the update. If the model may or may not output any responses, would you be able to use Decoupled mode? Under decoupled mode, the model may return zero or more responses per request as needed. I think this should provide enough flexibility for your use case without worrying about output validation.
@Tabrizian @krishung5 do you know if it is ok or not for Python backend models to send empty responses?
for request in requests:
inference_response = pb_utils.InferenceResponse(
output_tensors=[] # not to return output <----- empty output tensors here
)
responses.append(inference_response)
with dims: [ 1000, 1000 ]
output shape on model config.
The outputs returned by the model do not have to conform to the model configuration. If you want to make sure that the outputs returned by the model confirm to the model configuration you can modify your model.py to do that. This is due to performance reasons (i.e., not to slowdown the inferences with these checks on every inference).
Closing issue due to inactivity. Please reopen if you would like to follow up with this issue.
Background
I use shared memory output tensor as input tensor to use cuda async memcpy on python backend like below.
FYI, I use config.pbtxt written below. The reason why I use output config is I want to cover any case of inference simultaneously (e.g., pass numpy tensor directly without shared memory, pass shared memory)
Question