Triton gives wrong output

Description

The problem with the inference of the classifier model (HRNet W30) via the Triton Inference Server (of just Triton). The model was trained for 2 classes using MMPretrain (PyTorch underneath is used) framework. Then the weights are converted using the deploy.py script from MMDeploy framework into TensorRT format and were running inside Triton. Model weights worked correctly before conversion. After converting the weights into TensorRT format, the correct classification is also given through the test script from MMPretrain github - image_demo.py.

But when I try to get predicts via Triton server (using asynchronous requests), the model always returns a constant predicts. Its result through triton is always 1 confidence for the first class and 0 for the second.

To inference the Triton server, the original code from Nvidia's examples is used - simple_grpc_async_infer_client.py, just upgraded to operate with image. (see “The modified inference script”).

Triton Information I’m using Triton inside Docker container with base image – nvcr.io/nvidia/tritonserver:23.02-py3

Expected behavior The model should produce correct predictions for classes (at least non-constant)

Triton configs

Classifier config:

name: "repairs_cls_hrnetw30_trt_dynamic_384x512"
platform: "tensorrt_plan"
max_batch_size : 20

input [
  {
    name: "input"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [3, 512, 384 ]
    reshape { shape: [3, 512, 384 ] }
  }
]
output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [2]
  }
]
instance_group [
  {
    count: 4
    kind: KIND_GPU
  }
]

Preprocessing config:

name: "repairs_cls_hrnetw30_preprocess_384x512"
backend: "dali"
max_batch_size: 256

input [
  {
    name: "INPUT_PREPROCESS"
    data_type: TYPE_UINT8
    dims: [-1, -1, 3]
  }
]
output [
  {
    name: "OUTPUT_PREPROCESS"
    data_type: TYPE_FP32
    dims: [3, 512, 384 ]
    reshape { shape: [3, 512, 384 ] }
  }
]

instance_group [
  {
    count: 4
    gpus: [0]
    kind: KIND_GPU
  }
]

Ensemble:

name: "repairs_cls_ensemble"
platform: "ensemble"
max_batch_size: 1

input [
  {
    name: "input"
    data_type: TYPE_UINT8
    dims: [ -1, -1, 3]
  }
]

output [
  {
    name: "classes"
    data_type: TYPE_FP32
    dims: [2]
  }
]

ensemble_scheduling {
    step [
        {
            model_name: "repairs_cls_hrnetw30_preprocess_384x512"
            model_version: -1
            input_map {
                key: "INPUT_PREPROCESS"
                value: "input"
            }
            output_map {
                key: "OUTPUT_PREPROCESS"
                value: "preprocessed_image"
            }
        },

        {
            model_name: "repairs_cls_hrnetw30_trt_dynamic_384x512"
            model_version: -1
            input_map {
                key: "input"
                value: "preprocessed_image"
            }
            output_map [
                {
                    key: "output"
                    value: "classes"
                }
            ]
        }
    ]
}

Preprocessing DALI pipeline

import nvidia.dali.fn as fn
from nvidia.dali import pipeline_def
from nvidia.dali.types import FLOAT

@pipeline_def(batch_size=256, num_threads=4, device_id=0)
def hrnet_w30_cls_preprocess_pipeline():
    """
    Prepares pipeline of operations.
    :param batch_size: size of maximum batch for new DALI model.
    :param num_threads: number of CPU threads to be used.
    :param device_id: ID of GPU.
    :return: preprocessed images.
    """
    device = "gpu"
    images = fn.external_source(device=device, name="INPUT_PREPROCESS")
    images = fn.resize(images, size=(512, 384), mode="not_larger", device=device)
    images = fn.crop(images, crop=(512, 384), out_of_bounds_policy="pad", device=device)
    images = fn.cast(images, dtype=FLOAT, device=device)

    return images

Docker images Image used for conversion:

FROM nvcr.io/nvidia/tensorrt:23.02-py3

# Install dependencies
RUN apt-get update -y && DEBIAN_FRONTEND=noninteractive && \
    apt-get install -y --no-install-recommends \
    ffmpeg\
    libsm6\
    libxext6 \
    unixodbc-dev \
    python3-dev \
    python3-pip \
    python3-opencv \
    python3-psycopg2 \
    python3-setuptools \
    python3.8-dev \
    ca-certificates \
    git \
    wget \
    sudo \
    cmake \
    ninja-build \
    libgl1-mesa-dev \
    libgtk2.0-dev \
    gcc \
    g++ \
    unixodbc-dev \
    odbc-postgresql \
    build-essential  \
    libboost-python-dev  \
    libboost-thread-dev \
    openssh-server

RUN python3 -m pip install --upgrade pip

ARG TORCH_VERSION=1.13.1
ARG TORCHVISION_VERSION=0.14.1

# Install DALI - used for getting preprocessing DALI model.
RUN pip3 install --extra-index-url https://developer.download.nvidia.com/compute/redist --upgrade nvidia-dali-cuda120

# Install torch
RUN pip3 install torch==${TORCH_VERSION} torchvision==${TORCHVISION_VERSION} \
    --extra-index-url https://download.pytorch.org/whl/cu116

RUN pip3 install segmentation-models-pytorch==0.3.3

COPY converter/docker /tmp/docker
RUN pip3 install -r /tmp/docker/pip_requirements.txt
RUN mim install -r /tmp/docker/mim_requirements.txt

# PASSWORD
ENV USER=user
ENV PASSWORD=password

# Create directory
WORKDIR /home/"${USER}"/converter
COPY converter/dali dali
COPY converter/model_convert model_convert
COPY converter/src src
COPY converter/trt_infer trt_infer

RUN useradd -ms /bin/bash "${USER}" \
    && echo "${USER}:${PASSWORD}" | chpasswd \
    && adduser "${USER}" sudo \
    && sudo chmod -R 777 /home/"${USER}"/converter

ENV PYTHONPATH "${PYTHONPATH}:/home/${USER}/converter"
ENV PYTHONPATH "${PYTHONPATH}:/home/${USER}/converter/src"

RUN sudo chmod -R 777 /home

EXPOSE 22
CMD bash -c "service ssh start && while true; do sleep 30; done;"

The modified inference script

#!/usr/bin/env python
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
#  * Redistributions of source code must retain the above copyright
#    notice, this list of conditions and the following disclaimer.
#  * Redistributions in binary form must reproduce the above copyright
#    notice, this list of conditions and the following disclaimer in the
#    documentation and/or other materials provided with the distribution.
#  * Neither the name of NVIDIA CORPORATION nor the names of its
#    contributors may be used to endorse or promote products derived
#    from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

import argparse
import os
import sys
import time
from functools import partial

import numpy as np
import tritonclient.grpc as grpcclient
from mmcv import imread
from tritonclient.utils import InferenceServerException

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "-v",
        "--verbose",
        action="store_true",
        required=False,
        default=False,
        help="Enable verbose output",
    )
    parser.add_argument(
        "-u",
        "--url",
        type=str,
        required=False,
        default="192.168.0.207:30036",
        help="Inference server URL. Default is localhost:8001.",
    )
    parser.add_argument(
        "-t",
        "--client-timeout",
        type=float,
        required=False,
        default=None,
        help="Client timeout in seconds. Default is None.",
    )
    parser.add_argument(
        "-i",
        "--img",
        type=str,
        required=False,  
        default="_resources/img_1.png",
        help="Path to image",
    )
    # WARNING! Works only for single image

    FLAGS = parser.parse_args()
    try:
        triton_client = grpcclient.InferenceServerClient(
            url=FLAGS.url, verbose=FLAGS.verbose
        )
    except Exception as e:
        print("context creation failed: " + str(e))
        sys.exit()

    model_name = "repairs_cls_ensemble"

    img_list = []
    if os.path.isfile(FLAGS.img):
        img_list.append(FLAGS.img)
    else:
        for img_name in os.listdir(FLAGS.img):
            if img_name[-4:].lower() in ('.jpg', 'jpeg', '.png', 'bmp'):
                img_list.append(os.path.join(FLAGS.img, img_name))

    # Infer
    inputs = []
    outputs = []
    user_data = []

    # Define the callback function. Note the last two parameters should be
    # result and error. InferenceServerClient would povide the results of an
    # inference as grpcclient.InferResult in result. For successful
    # inference, error will be None, otherwise it will be an object of
    # tritonclientutils.InferenceServerException holding the error details

    def callback(user_data, result, error):
        if error:
            user_data.append(error)
        else:
            user_data.append(result)

    # Initialize the data
    frame = imread(img_list[0])

    frame = np.expand_dims(frame, axis=0)
    inputs.append(grpcclient.InferInput("input", frame.shape, "UINT8"))
    outputs.append(grpcclient.InferRequestedOutput("classes", class_count=2))
    inputs[0].set_data_from_numpy(frame)

    # Inference call
    triton_client.async_infer(
        model_name=model_name,
        inputs=inputs,
        callback=partial(callback, user_data),
        outputs=outputs,
        client_timeout=FLAGS.client_timeout,
    )

    # Wait until the results are available in user_data
    time_out = 10
    while (len(user_data) == 0) and time_out > 0:
        time_out = time_out - 1
        time.sleep(1)

    # Display and validate the available results
    if len(user_data) == 1:
        # Check for the errors
        if type(user_data) == InferenceServerException:
            print(user_data)
            sys.exit(1)

        output_data = user_data[0].as_numpy("classes")

        print(output_data)

triton-inference-server / server

Triton gives wrong output #7631