Closed jackylu0124 closed 1 year ago
Another related question I have is that what should be the correct shapes to use for ONNX models that take in a fixed input shape with batch size of 1 (e.g. the input shape of 1x3x473x473
in the conv_single_batch.onnx
model discussed above). Is the model configuration file I pasted above for that single batch model correct? My configuration file was made based on my understanding of the following paragraph in the documentation site (https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html).
_"Input and output shapes are specified by a combination of max_batch_size and the dimensions specified by the input or output dims property. For models with max_batch_size greater-than 0, the full shape is formed as [ -1 ] + dims. For models with max_batchsize equal to 0, the full shape is formed as dims. For example, for the following configuration the shape of “input0” is [ -1, 16 ] and the shape of “output0” is [ -1, 4 ]."
Hi @jackylu0124, is the all zero output that you are observing only happen when the ONNX model is invoked using BLS or it also happens when you send requests to this model individually?
The Failed to open the cudaIpcHandle.
looks like a bug that needs further investigation.
I think your model configuration is correct. Triton can also auto-complete the model configuration for ONNX models so you don't have to provide the configuration files for this type of model.
Thank you very much for taking a look at it! The all zero outputs I have observed happen when I invoked model inference calls inside BLS. I log out the content of the tensors in both the server's BLS code (model.py
) and in the Python client that receives the response from the BLS model, and both show an all zeros tensor. I think I also tried sending requests to the ONNX model directly from the Python client (client.py
), but the program simply hangs and the server was not able to receive the inference requests. Do you by chance have any insights or suggestions on why this might happen based on the behaviors described above?
And thanks for the confirmation on teh configration files, do you think the issues might have happened because I explicitly provide the configuration files, perhaps I should let Triton auto-complete the model completion rather than providing my own?
Thanks a lot for the help again!
Hi Jacky, thanks for providing detailed repro instructions. I tried both single and dynamic batch but the client outputs non-zero tensors and doesn't return the error.
And thanks for the confirmation on the configuration files, do you think the issues might have happened because I explicitly provide the configuration files, perhaps I should let Triton auto-complete the model completion rather than providing my own?
I don't think the errors that you are seeing are because of the configuration files but auto-complete can help with easier deployment of your models.
I'm not able to repro this problem in my environment. The only difference between my environment and @jackylu0124's env is that he is using CUDA 11.6. As a next step, he's going to try upgrading the issue to see whether it would resolve this problem.
Closing due to in-activity.
I also encounter the similar bug. https://github.com/triton-inference-server/server/issues/6220
Description I am calling inference requests on multiple ONNX models using the ONNX Runtime CUDA backend (
KIND_GPU
) in the Python backend file. For my ONNX model that takes in input with fixed batch size of 1, the inference request returns a tensor that contains all zeros, which is different from the ONNX inference results using pure ONNX Runtime inference outside of Triton. For the purpose of reproduction of this issue, I have created two very simple ONNX models that contain only one convolution layer inside, one model (conv_single_batch.onnx
) takes in an input with fixed size of1x3x473x473
and the other model (conv_dynamic_batch.onnx
) can take input with dynamic batch size (e.g.Nx3x473x473
), and the only single convolution layer in both models have non-zero weights and biases, and the reproduction example tries to run these two models on an input tensor with all ones. The behavior I have observed is that the inference request on theconv_single_batch.onnx
always returns a tensor with all zeros, and subsequent inference calls on it will lead Triton to emit the error message"tritonclient.utils.InferenceServerException: Failed to process the request(s) for model instance 'pipeline_0', message: TritonModelException: Failed to open the cudaIpcHandle. error: unknown error"
. However, if I switch the inference backend fromKIND_GPU
toKIND_CPU
, then sometimes it could return results that are not all zeros. On the other hand, for theconv_dynamic_batch.onnx
model that can take in dynamic batch size, the first inference call to it could produce correct results, but likewise subsequent inference calls to it lead to the same error mesasge"tritonclient.utils.InferenceServerException: Failed to process the request(s) for model instance 'pipeline_0', message: TritonModelException: Failed to open the cudaIpcHandle. error: unknown error"
. I have attached the entire zipped project with code and ONNX models below, and for others' convenience, I also pasted my code as well as screenshots of my ONNX models' structure below.Zipped Folder Containing All Files And ONNX Models TritonDebug.zip
Triton Information I am using the
nvcr.io/nvidia/tritonserver:23.01-py3
Docker container.To Reproduce I have included a simple client file (
client.py
) in the zipped folder that makes inference requests to the Triton Inferenc Server. You can reproduce the issues mentioned above by running the client file after launching the Triton Inference Server.Expected behavior Inference requests' results should not return tensors containing all zeros and additional inference request calls should not cause Triton Inference Server to emit the error message
"tritonclient.utils.InferenceServerException: Failed to process the request(s) for model instance 'pipeline_0', message: TritonModelException: Failed to open the cudaIpcHandle. error: unknown error"
.Python Backend File (
model.py
)Python Backend Config File
Single Batch Model Config File (for
conv_single_batch.onnx
)Dynamic Batch Size Model Config File (for
conv_dynamic_batch.onnx
)Client file for making inference requests (client.py)
Screenshot of the
conv_single_batch.onnx
model in NetronScreenshot of the
conv_dynamic_batch.onnx
model in Netron