triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.27k stars 1.47k forks source link

Inference result of single batch ONNX model contains all zeros and also emits "Failed to open the cudaIpcHandle." error in additional inference calls #5471

Closed jackylu0124 closed 1 year ago

jackylu0124 commented 1 year ago

Description I am calling inference requests on multiple ONNX models using the ONNX Runtime CUDA backend (KIND_GPU) in the Python backend file. For my ONNX model that takes in input with fixed batch size of 1, the inference request returns a tensor that contains all zeros, which is different from the ONNX inference results using pure ONNX Runtime inference outside of Triton. For the purpose of reproduction of this issue, I have created two very simple ONNX models that contain only one convolution layer inside, one model (conv_single_batch.onnx) takes in an input with fixed size of 1x3x473x473 and the other model (conv_dynamic_batch.onnx) can take input with dynamic batch size (e.g. Nx3x473x473), and the only single convolution layer in both models have non-zero weights and biases, and the reproduction example tries to run these two models on an input tensor with all ones. The behavior I have observed is that the inference request on the conv_single_batch.onnx always returns a tensor with all zeros, and subsequent inference calls on it will lead Triton to emit the error message "tritonclient.utils.InferenceServerException: Failed to process the request(s) for model instance 'pipeline_0', message: TritonModelException: Failed to open the cudaIpcHandle. error: unknown error". However, if I switch the inference backend from KIND_GPU to KIND_CPU, then sometimes it could return results that are not all zeros. On the other hand, for the conv_dynamic_batch.onnx model that can take in dynamic batch size, the first inference call to it could produce correct results, but likewise subsequent inference calls to it lead to the same error mesasge "tritonclient.utils.InferenceServerException: Failed to process the request(s) for model instance 'pipeline_0', message: TritonModelException: Failed to open the cudaIpcHandle. error: unknown error". I have attached the entire zipped project with code and ONNX models below, and for others' convenience, I also pasted my code as well as screenshots of my ONNX models' structure below.

Zipped Folder Containing All Files And ONNX Models TritonDebug.zip

Triton Information I am using the nvcr.io/nvidia/tritonserver:23.01-py3 Docker container.

To Reproduce I have included a simple client file (client.py) in the zipped folder that makes inference requests to the Triton Inferenc Server. You can reproduce the issues mentioned above by running the client file after launching the Triton Inference Server.

Expected behavior Inference requests' results should not return tensors containing all zeros and additional inference request calls should not cause Triton Inference Server to emit the error message "tritonclient.utils.InferenceServerException: Failed to process the request(s) for model instance 'pipeline_0', message: TritonModelException: Failed to open the cudaIpcHandle. error: unknown error".

Python Backend File (model.py)

import triton_python_backend_utils as pb_utils
import os
import numpy as np
import torch
import torch.nn.functional as F
from torch.utils.dlpack import to_dlpack, from_dlpack

def get_response_tensor_by_name(response, name):
    if response.has_error():
        raise pb_utils.TritonModelException(response.error().message())
    else:
        pb_tensor = pb_utils.get_output_tensor_by_name(response, name)
        if pb_tensor.is_cpu():
            return pb_tensor.as_numpy()
        else:
            return from_dlpack(pb_tensor.to_dlpack()).cpu().numpy()

def run_single_batch_model(img):
    input = pb_utils.Tensor("img", img)
    request = pb_utils.InferenceRequest(model_name="single_batch", inputs=[input], requested_output_names=["out"])
    response = request.exec()
    out = get_response_tensor_by_name(response, "out")
    return out

def run_dynamic_batch_model(img):
    input = pb_utils.Tensor("img", img)
    request = pb_utils.InferenceRequest(model_name="dynamic_batch", inputs=[input], requested_output_names=["out"])
    response = request.exec()
    out = get_response_tensor_by_name(response, "out")
    return out

class TritonPythonModel:
    def initialize(self, args):
        self.logger = pb_utils.Logger
        self.logger.log_info("Initialization completed.")

    def execute(self, requests):
        responses = []
        for request in requests:
            img = pb_utils.get_input_tensor_by_name(request, "img").as_numpy()

            single_batch_output = run_single_batch_model(img.copy())
            self.logger.log_info("single_batch_output: " + str(single_batch_output))

            inference_response = pb_utils.InferenceResponse(output_tensors=[
                pb_utils.Tensor(
                    "out",
                    single_batch_output
                )
            ])

            # dynamic_batch_output = run_dynamic_batch_model(img.copy())
            # self.logger.log_info("dynamic_batch_output: " + str(dynamic_batch_output))

            # inference_response = pb_utils.InferenceResponse(output_tensors=[
            #     pb_utils.Tensor(
            #         "out",
            #         dynamic_batch_output
            #     )
            # ])
            responses.append(inference_response)

        return responses

    def finalize(self):
        self.logger.log_info("Finalization/clean-up completed.")

Python Backend Config File

name: "pipeline"
backend: "python"
max_batch_size: 4

input [
    {
        name: "img"
        data_type: TYPE_FP32
        dims: [3, 473, 473]
    }
]

output [
    {
        name: "out"
        data_type: TYPE_FP32
        dims: [20, 471, 471]
    }
]

instance_group [
    {
        kind: KIND_GPU
    }
]

Single Batch Model Config File (for conv_single_batch.onnx)

name: "single_batch"
platform: "onnxruntime_onnx"
default_model_filename: "conv_single_batch.onnx"
max_batch_size: 0

input [
    {
        name: "img"
        data_type: TYPE_FP32
        dims: [1, 3, 473, 473]
    }
]

output [
    {
        name: "out"
        data_type: TYPE_FP32
        dims: [1, 20, 471, 471]
    }
]

instance_group [
    {
        kind: KIND_GPU
    }
]

model_warmup [
    {
        name: "Random Sample"
        inputs {
            key: "img"
            value: {
                data_type: TYPE_FP32
                dims: [1, 3, 473, 473]
                random_data: true
            }
        }
    }
]

Dynamic Batch Size Model Config File (for conv_dynamic_batch.onnx)

name: "dynamic_batch"
platform: "onnxruntime_onnx"
default_model_filename: "conv_dynamic_batch.onnx"
max_batch_size: 4

input [
    {
        name: "img"
        data_type: TYPE_FP32
        dims: [3, 473, 473]
    }
]

output [
    {
        name: "out"
        data_type: TYPE_FP32
        dims: [20, 471, 471]
    }
]

instance_group [
    {
        kind: KIND_GPU
    }
]

model_warmup [
    {
        name: "Random Sample"
        batch_size: 4
        inputs {
            key: "img"
            value: {
                data_type: TYPE_FP32
                dims: [3, 473, 473]
                random_data: true
            }
        }
    }
]

Client file for making inference requests (client.py)

import numpy as np
from tritonclient.utils import *
import tritonclient.http as httpclient

def main():
    client = httpclient.InferenceServerClient(url="localhost:8000")
    img = np.ones((1, 3, 473, 473), dtype=np.float32)
    img_input = httpclient.InferInput("img", img.shape,
                                       np_to_triton_dtype(img.dtype))
    img_input.set_data_from_numpy(img)

    out = httpclient.InferRequestedOutput("out")

    response = client.infer(model_name="pipeline",
                            inputs=[img_input],
                            outputs=[out])

    image = response.as_numpy("out")
    print(image)

if __name__ == "__main__":
    main()

Screenshot of the conv_single_batch.onnx model in Netron image

Screenshot of the conv_dynamic_batch.onnx model in Netron image

jackylu0124 commented 1 year ago

Another related question I have is that what should be the correct shapes to use for ONNX models that take in a fixed input shape with batch size of 1 (e.g. the input shape of 1x3x473x473 in the conv_single_batch.onnx model discussed above). Is the model configuration file I pasted above for that single batch model correct? My configuration file was made based on my understanding of the following paragraph in the documentation site (https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html).

_"Input and output shapes are specified by a combination of max_batch_size and the dimensions specified by the input or output dims property. For models with max_batch_size greater-than 0, the full shape is formed as [ -1 ] + dims. For models with max_batchsize equal to 0, the full shape is formed as dims. For example, for the following configuration the shape of “input0” is [ -1, 16 ] and the shape of “output0” is [ -1, 4 ]."

Tabrizian commented 1 year ago

Hi @jackylu0124, is the all zero output that you are observing only happen when the ONNX model is invoked using BLS or it also happens when you send requests to this model individually?

The Failed to open the cudaIpcHandle. looks like a bug that needs further investigation.

I think your model configuration is correct. Triton can also auto-complete the model configuration for ONNX models so you don't have to provide the configuration files for this type of model.

jackylu0124 commented 1 year ago

Thank you very much for taking a look at it! The all zero outputs I have observed happen when I invoked model inference calls inside BLS. I log out the content of the tensors in both the server's BLS code (model.py) and in the Python client that receives the response from the BLS model, and both show an all zeros tensor. I think I also tried sending requests to the ONNX model directly from the Python client (client.py), but the program simply hangs and the server was not able to receive the inference requests. Do you by chance have any insights or suggestions on why this might happen based on the behaviors described above?

And thanks for the confirmation on teh configration files, do you think the issues might have happened because I explicitly provide the configuration files, perhaps I should let Triton auto-complete the model completion rather than providing my own?

Thanks a lot for the help again!

Tabrizian commented 1 year ago

Hi Jacky, thanks for providing detailed repro instructions. I tried both single and dynamic batch but the client outputs non-zero tensors and doesn't return the error.

And thanks for the confirmation on the configuration files, do you think the issues might have happened because I explicitly provide the configuration files, perhaps I should let Triton auto-complete the model completion rather than providing my own?

I don't think the errors that you are seeing are because of the configuration files but auto-complete can help with easier deployment of your models.

Tabrizian commented 1 year ago

I'm not able to repro this problem in my environment. The only difference between my environment and @jackylu0124's env is that he is using CUDA 11.6. As a next step, he's going to try upgrading the issue to see whether it would resolve this problem.

Tabrizian commented 1 year ago

Closing due to in-activity.

Grople commented 1 year ago

I also encounter the similar bug. https://github.com/triton-inference-server/server/issues/6220