Description Hello

I was expecting that triton would be better than torchserve when I read some documentation. But in the images bellow, i see that: with the same model, the throughput of triton is lower the throughput of torchserve. Despite of I used tensorrt model in triton (the throughput of tensorrt model inference are faster than pytorch model inference)

Screenshot from 2021-08-25 18-29-01

Could it be that was wrong at some point ?

Triton Information

Version triton 21.07

To Reproduce

This is what i did:

In model repository folder, with using python backend: I add a model in model repos (creating model.py file, adding the weight of the tensorrt trained model here, creating config.pbtxt file, build stub and conda-pack my envs, ...)
In model.py at TritonPythonModel class:
- I load pretrained model at initialize function
- In excuse funcion, I implement my model inference and convert to format for triton response.

Expected behavior

Am I doing it right?. If I'm wrong can you point me in the right direction.

@gioipv Could you please share your model.py file. From your overall description, it seems like you are doing it correctly. What inferencing solution does "model inference" column use? Is it using the PyTorch backend?

How did you measure the performance of your Python models? Did you use Perf Analyzer?

@Tabrizian Thankyou for reply

I read perf analyzer. I haven't used it yet. But I think this is simply a tool for measure performance. I used http request. For performance measure: I get the duration time from when I send a request to when I get a response at client. With throughput: when server stable, (ignore the first request) I get the time of 1000 request and dividing it by 1000 to get the throughput.
The model inference column is implement of my model inference, (not using the sides: server and client). not pytorch backend of triton. Simply, I get a image --> implement pre-process, and direct pass to my model --> post-process.
In my model: The processes in model inference at model inference column (such as pre-process, inference and post-process) are same with the processes in triton and torchserve. except in torchserve and triton have some own processes.(triton is execute at TritonPythonModel class and torchserve is handler)
This is my model.py :
```
import json
import numpy as np
```

import os import torch from torch2trt import TRTModule from torchvision import transforms import cv2 import triton_python_backend_utils as pb_utils

class eye_state_model(): def init(self, trained_path, gpu) -> None: self.gpu = gpu self.trained_path = trained_path self.device = self.device_initialize() self.model = self.load_model() def device_initialize(self):

do implement device init

    return device
def load_model(self):
    # do implement load model, load model from weight file and set to eval mode
    return model_trt
def preprocess_img(self, img):
    # do implemt prep-process
    return img
def inference(self, data):
    # model inference here
    return result
def postprocess(self, data):
    # do implemnt model post-process.
    return predict_idx

class TritonPythonModel: """ Your Python model must use the same class name. Every Python model that is created must have "TritonPythonModel" as the class name. """ def initialize(self, args): """`initialize` is called only once when the model is being loaded. Implementing `initialize` function is optional. This function allows the model to intialize any state associated with this model. Parameters

    args : dict
      Both keys and values are strings. The dictionary keys and values are:
      * model_config: A JSON string containing the model configuration
      * model_instance_kind: A string containing model instance kind
      * model_instance_device_id: A string containing model instance device ID
      * model_repository: Model repository path
      * model_version: Model version
      * model_name: Model name
    """
    # You must parse model_config. JSON string is not parsed here
    self.model_config = json.loads(args['model_config'])
    # Get OUTPUT0 configuration
    output0_config = pb_utils.get_output_config_by_name(self.model_config, "OUTPUT0")
    # Convert Triton types to numpy types
    self.output0_dtype = pb_utils.triton_string_to_numpy(output0_config['data_type'])

    # DOING: initialize model from trained model path
    trained_model_path = self.model_config["parameters"]['TRAINED_PATH']['string_value']
    print("load model from trained model path:", trained_model_path)
    self.model = eye_state_model(trained_path=trained_model_path, gpu=0)
    print("load model successful\n")

def execute(self, requests):
    """`execute` MUST be implemented in every Python model. `execute`
    function receives a list of pb_utils.InferenceRequest as the only
    argument. This function is called when an inference request is made
    for this model. Depending on the batching configuration (e.g. Dynamic
    Batching) used, `requests` may contain multiple requests. Every
    Python model, must create one pb_utils.InferenceResponse for every
    pb_utils.InferenceRequest in `requests`. If there is an error, you can
    set the error argument when creating a pb_utils.InferenceResponse
    Parameters
    ----------
    requests : list
      A list of pb_utils.InferenceRequest
    Returns
    -------
    list
      A list of pb_utils.InferenceResponse. The length of this list must
      be the same as `requests`
    """
    output0_dtype = self.output0_dtype
    responses = []
    for request in requests:
        # print('print request:', request)
        # Get INPUT0
        in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
        img = in_0.as_numpy()
        img = self.model.preprocess_img(img)
        infer = self.model.inference(img)
        idx = self.model.postprocess(infer)
        # print("output response:", idx)
        out_tensor_0 = pb_utils.Tensor("OUTPUT0", idx.astype(output0_dtype))
        inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor_0])
        responses.append(inference_response)
    return responses

def finalize(self):
    """`finalize` is called only once when the model is being unloaded.
    Implementing `finalize` function is OPTIONAL. This function allows
    the model to perform any necessary clean ups before exit.
    """
    print('Cleaning up...')

I think you have shared a blueprint of your model. In this file that you've shared I can't see any issues. You need to make sure that you are not using array slices or NumPy functions that could make a copy of your tensor. Using perf analyzer is very easy and can make sure that there are not any bugs in your performance measurement methods. You just need to run perf_analyzer -m <your_model_name>. Would be great if you can share the perf analyzer numbers for the same baselines.

srr, I just want my comment to be brief. so, this is full code of my model.py :

import json
import numpy as np

import os
import torch
from torch2trt import TRTModule
from torchvision import transforms
import cv2
import triton_python_backend_utils as pb_utils

class resize_img(object):
    def __init__(self, img_size=(224, 224)):
        self.dsize = img_size # (width, height)
    def __call__(self, img):
        assert type(img).__module__ == 'numpy'
        resized = cv2.resize(img, self.dsize, interpolation = cv2.INTER_AREA)
        return resized

class normalize(object):

    def __init__(self, mode='rgb') -> None:
        self.mode = mode
        self.normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                              std=[0.229, 0.224, 0.225])
        super().__init__()

    def __call__(self, image_tensor):
        assert type(image_tensor).__module__ == 'torch'
        if image_tensor.size(-1) in [3, 1]:
            # Need convert to (C,H,W) : https://pytorch.org/vision/stable/_modules/torchvision/transforms/transforms.html#RandomHorizontalFlip
            image_tensor = image_tensor.permute(2, 0, 1)
        if self.mode == 'rgb':
            assert image_tensor.size(0) == 3
            #Normalize RGB img
            compose = transforms.Compose([self.normalize])
            image_tensor = compose(image_tensor)
            return image_tensor
        elif self.mode == 'gray':
        # assert image_tensor.size(0) == 1:
            # Normalize gray img
            image_tensor = image_tensor/255.
            return image_tensor
        return None

class eye_state_model():

    def __init__(self, trained_path, gpu) -> None:
        self.gpu = gpu
        self.trained_path = trained_path
        self.device = self.device_initialize()
        self.model = self.load_model()
        self.composed = transforms.Compose([
                                            resize_img(),
                                            transforms.ToTensor(),
                                            normalize()])

    def device_initialize(self):
        # device = 'cuda:{}'.format(self.gpu)
        device = 'cuda'
        return device

    def load_model(self):
        model_trt = TRTModule()
        model_trt.load_state_dict(torch.load(self.trained_path, map_location='cpu'))
        model_trt.eval()
        model_trt.to(self.device)
        return model_trt

    def preprocess_img(self, img):
        img = self.composed(img)
        img = img.unsqueeze(0)  # add a dimension to fix format
        return img

    def inference(self, data):
        data = data.to(self.device)
        result = self.model(data)
        return result

    def postprocess(self, data):
        predict_idx = data.argmax(1).cpu().detach().numpy()[0]
        return predict_idx

class TritonPythonModel:
    """
    Your Python model must use the same class name. Every Python model
    that is created must have "TritonPythonModel" as the class name.
    """
    def initialize(self, args):
        """`initialize` is called only once when the model is being loaded.
        Implementing `initialize` function is optional. This function allows
        the model to intialize any state associated with this model.
        Parameters
        ----------
        args : dict
          Both keys and values are strings. The dictionary keys and values are:
          * model_config: A JSON string containing the model configuration
          * model_instance_kind: A string containing model instance kind
          * model_instance_device_id: A string containing model instance device ID
          * model_repository: Model repository path
          * model_version: Model version
          * model_name: Model name
        """
        # You must parse model_config. JSON string is not parsed here
        self.model_config = json.loads(args['model_config'])
        # Get OUTPUT0 configuration
        output0_config = pb_utils.get_output_config_by_name(self.model_config, "OUTPUT0")
        # Convert Triton types to numpy types
        self.output0_dtype = pb_utils.triton_string_to_numpy(output0_config['data_type'])

        # DOING: initialize model from trained model path
        trained_model_path = self.model_config["parameters"]['TRAINED_PATH']['string_value']
        print("load model from trained model path:", trained_model_path)
        self.model = eye_state_model(trained_path=trained_model_path, gpu=0)
        print("load model successful\n")

    def execute(self, requests):
        """`execute` MUST be implemented in every Python model. `execute`
        function receives a list of pb_utils.InferenceRequest as the only
        argument. This function is called when an inference request is made
        for this model. Depending on the batching configuration (e.g. Dynamic
        Batching) used, `requests` may contain multiple requests. Every
        Python model, must create one pb_utils.InferenceResponse for every
        pb_utils.InferenceRequest in `requests`. If there is an error, you can
        set the error argument when creating a pb_utils.InferenceResponse
        Parameters
        ----------
        requests : list
          A list of pb_utils.InferenceRequest
        Returns
        -------
        list
          A list of pb_utils.InferenceResponse. The length of this list must
          be the same as `requests`
        """
        output0_dtype = self.output0_dtype
        responses = []
        for request in requests:
            print('print request:', request)
            # Get INPUT0
            in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
            img = in_0.as_numpy()
            img = self.model.preprocess_img(img)
            infer = self.model.inference(img)
            idx = self.model.postprocess(infer)
            # print("output response:", idx)
            out_tensor_0 = pb_utils.Tensor("OUTPUT0", idx.astype(output0_dtype))
            inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor_0])
            responses.append(inference_response)
        return responses

    def finalize(self):
        """`finalize` is called only once when the model is being unloaded.
        Implementing `finalize` function is OPTIONAL. This function allows
        the model to perform any necessary clean ups before exit.
        """
        print('Cleaning up...')

[Update]: This is the result of my model use perf_analyzer for performances measure on triton. I also implement perf_analyzer on torchserve.

Using the same input with triton and torchserve. triton command

perf_analyzer --service-kind triton -m eye_state_model -v -u http://172.16.19.68:8080 --percentile=95 --input-data /home/gioipv/workspaces/triton_template/client/perf_analyzer/data_test.json

torchserve

perf_analyzer --service-kind torchserve -m eye_state_model --percentile=95 -v -u http://127.0.0.1:4886 --input-data /home/gioipv/workspaces/ekyc_fault_detection/ekyc_eye_state/perf_analyzer/data_test.json

The fomat in the *.json files in each server ( triton and torchserve) are difference but are same in content.

This is the result:

Screenshot from 2021-08-27 15-09-48

I can see that, the throughput of triton is better.
But I think about how to calculate my throughput at the top: my calculation is close to torchserve but different from perf_analyzer. So it looks like triton's client API can take a long time. In the top, i use tritonclient.http for send request to triton server. What do you think about that: (triton's client API can take a long time)

@gioipv Thanks for sharing the results.

What do you think about that: (triton's client API can take a long time)

Triton Python Client may add some latency because of using Python API which could be slower. Also, if you want to try concurrency values higher than 1 it would be harder to create the same scenario using Python client.

triton-inference-server / server

Performance on triton with python backend #3274

do implement device init