triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.23k stars 1.47k forks source link

Performance on triton with python backend #3274

Closed gioipv closed 3 years ago

gioipv commented 3 years ago

Description Hello

Screenshot from 2021-08-25 18-29-01

Triton Information

Version triton 21.07

To Reproduce

This is what i did:

Expected behavior

Am I doing it right?. If I'm wrong can you point me in the right direction.

Tabrizian commented 3 years ago

@gioipv Could you please share your model.py file. From your overall description, it seems like you are doing it correctly. What inferencing solution does "model inference" column use? Is it using the PyTorch backend?

How did you measure the performance of your Python models? Did you use Perf Analyzer?

gioipv commented 3 years ago

@Tabrizian Thankyou for reply

import os import torch from torch2trt import TRTModule from torchvision import transforms import cv2 import triton_python_backend_utils as pb_utils

class eye_state_model(): def init(self, trained_path, gpu) -> None: self.gpu = gpu self.trained_path = trained_path self.device = self.device_initialize() self.model = self.load_model() def device_initialize(self):

do implement device init

    return device
def load_model(self):
    # do implement load model, load model from weight file and set to eval mode
    return model_trt
def preprocess_img(self, img):
    # do implemt prep-process
    return img
def inference(self, data):
    # model inference here
    return result
def postprocess(self, data):
    # do implemnt model post-process.
    return predict_idx

class TritonPythonModel: """ Your Python model must use the same class name. Every Python model that is created must have "TritonPythonModel" as the class name. """ def initialize(self, args): """initialize is called only once when the model is being loaded. Implementing initialize function is optional. This function allows the model to intialize any state associated with this model. Parameters

    args : dict
      Both keys and values are strings. The dictionary keys and values are:
      * model_config: A JSON string containing the model configuration
      * model_instance_kind: A string containing model instance kind
      * model_instance_device_id: A string containing model instance device ID
      * model_repository: Model repository path
      * model_version: Model version
      * model_name: Model name
    """
    # You must parse model_config. JSON string is not parsed here
    self.model_config = json.loads(args['model_config'])
    # Get OUTPUT0 configuration
    output0_config = pb_utils.get_output_config_by_name(self.model_config, "OUTPUT0")
    # Convert Triton types to numpy types
    self.output0_dtype = pb_utils.triton_string_to_numpy(output0_config['data_type'])

    # DOING: initialize model from trained model path
    trained_model_path = self.model_config["parameters"]['TRAINED_PATH']['string_value']
    print("load model from trained model path:", trained_model_path)
    self.model = eye_state_model(trained_path=trained_model_path, gpu=0)
    print("load model successful\n")

def execute(self, requests):
    """`execute` MUST be implemented in every Python model. `execute`
    function receives a list of pb_utils.InferenceRequest as the only
    argument. This function is called when an inference request is made
    for this model. Depending on the batching configuration (e.g. Dynamic
    Batching) used, `requests` may contain multiple requests. Every
    Python model, must create one pb_utils.InferenceResponse for every
    pb_utils.InferenceRequest in `requests`. If there is an error, you can
    set the error argument when creating a pb_utils.InferenceResponse
    Parameters
    ----------
    requests : list
      A list of pb_utils.InferenceRequest
    Returns
    -------
    list
      A list of pb_utils.InferenceResponse. The length of this list must
      be the same as `requests`
    """
    output0_dtype = self.output0_dtype
    responses = []
    for request in requests:
        # print('print request:', request)
        # Get INPUT0
        in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
        img = in_0.as_numpy()
        img = self.model.preprocess_img(img)
        infer = self.model.inference(img)
        idx = self.model.postprocess(infer)
        # print("output response:", idx)
        out_tensor_0 = pb_utils.Tensor("OUTPUT0", idx.astype(output0_dtype))
        inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor_0])
        responses.append(inference_response)
    return responses

def finalize(self):
    """`finalize` is called only once when the model is being unloaded.
    Implementing `finalize` function is OPTIONAL. This function allows
    the model to perform any necessary clean ups before exit.
    """
    print('Cleaning up...')
Tabrizian commented 3 years ago

I think you have shared a blueprint of your model. In this file that you've shared I can't see any issues. You need to make sure that you are not using array slices or NumPy functions that could make a copy of your tensor. Using perf analyzer is very easy and can make sure that there are not any bugs in your performance measurement methods. You just need to run perf_analyzer -m <your_model_name>. Would be great if you can share the perf analyzer numbers for the same baselines.

gioipv commented 3 years ago

srr, I just want my comment to be brief. so, this is full code of my model.py :

import json
import numpy as np

import os
import torch
from torch2trt import TRTModule
from torchvision import transforms
import cv2
import triton_python_backend_utils as pb_utils

class resize_img(object):
    def __init__(self, img_size=(224, 224)):
        self.dsize = img_size # (width, height)
    def __call__(self, img):
        assert type(img).__module__ == 'numpy'
        resized = cv2.resize(img, self.dsize, interpolation = cv2.INTER_AREA)
        return resized

class normalize(object):

    def __init__(self, mode='rgb') -> None:
        self.mode = mode
        self.normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                              std=[0.229, 0.224, 0.225])
        super().__init__()

    def __call__(self, image_tensor):
        assert type(image_tensor).__module__ == 'torch'
        if image_tensor.size(-1) in [3, 1]:
            # Need convert to (C,H,W) : https://pytorch.org/vision/stable/_modules/torchvision/transforms/transforms.html#RandomHorizontalFlip
            image_tensor = image_tensor.permute(2, 0, 1)
        if self.mode == 'rgb':
            assert image_tensor.size(0) == 3
            #Normalize RGB img
            compose = transforms.Compose([self.normalize])
            image_tensor = compose(image_tensor)
            return image_tensor
        elif self.mode == 'gray':
        # assert image_tensor.size(0) == 1:
            # Normalize gray img
            image_tensor = image_tensor/255.
            return image_tensor
        return None

class eye_state_model():

    def __init__(self, trained_path, gpu) -> None:
        self.gpu = gpu
        self.trained_path = trained_path
        self.device = self.device_initialize()
        self.model = self.load_model()
        self.composed = transforms.Compose([
                                            resize_img(),
                                            transforms.ToTensor(),
                                            normalize()])

    def device_initialize(self):
        # device = 'cuda:{}'.format(self.gpu)
        device = 'cuda'
        return device

    def load_model(self):
        model_trt = TRTModule()
        model_trt.load_state_dict(torch.load(self.trained_path, map_location='cpu'))
        model_trt.eval()
        model_trt.to(self.device)
        return model_trt

    def preprocess_img(self, img):
        img = self.composed(img)
        img = img.unsqueeze(0)  # add a dimension to fix format
        return img

    def inference(self, data):
        data = data.to(self.device)
        result = self.model(data)
        return result

    def postprocess(self, data):
        predict_idx = data.argmax(1).cpu().detach().numpy()[0]
        return predict_idx

class TritonPythonModel:
    """
    Your Python model must use the same class name. Every Python model
    that is created must have "TritonPythonModel" as the class name.
    """
    def initialize(self, args):
        """`initialize` is called only once when the model is being loaded.
        Implementing `initialize` function is optional. This function allows
        the model to intialize any state associated with this model.
        Parameters
        ----------
        args : dict
          Both keys and values are strings. The dictionary keys and values are:
          * model_config: A JSON string containing the model configuration
          * model_instance_kind: A string containing model instance kind
          * model_instance_device_id: A string containing model instance device ID
          * model_repository: Model repository path
          * model_version: Model version
          * model_name: Model name
        """
        # You must parse model_config. JSON string is not parsed here
        self.model_config = json.loads(args['model_config'])
        # Get OUTPUT0 configuration
        output0_config = pb_utils.get_output_config_by_name(self.model_config, "OUTPUT0")
        # Convert Triton types to numpy types
        self.output0_dtype = pb_utils.triton_string_to_numpy(output0_config['data_type'])

        # DOING: initialize model from trained model path
        trained_model_path = self.model_config["parameters"]['TRAINED_PATH']['string_value']
        print("load model from trained model path:", trained_model_path)
        self.model = eye_state_model(trained_path=trained_model_path, gpu=0)
        print("load model successful\n")

    def execute(self, requests):
        """`execute` MUST be implemented in every Python model. `execute`
        function receives a list of pb_utils.InferenceRequest as the only
        argument. This function is called when an inference request is made
        for this model. Depending on the batching configuration (e.g. Dynamic
        Batching) used, `requests` may contain multiple requests. Every
        Python model, must create one pb_utils.InferenceResponse for every
        pb_utils.InferenceRequest in `requests`. If there is an error, you can
        set the error argument when creating a pb_utils.InferenceResponse
        Parameters
        ----------
        requests : list
          A list of pb_utils.InferenceRequest
        Returns
        -------
        list
          A list of pb_utils.InferenceResponse. The length of this list must
          be the same as `requests`
        """
        output0_dtype = self.output0_dtype
        responses = []
        for request in requests:
            print('print request:', request)
            # Get INPUT0
            in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
            img = in_0.as_numpy()
            img = self.model.preprocess_img(img)
            infer = self.model.inference(img)
            idx = self.model.postprocess(infer)
            # print("output response:", idx)
            out_tensor_0 = pb_utils.Tensor("OUTPUT0", idx.astype(output0_dtype))
            inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor_0])
            responses.append(inference_response)
        return responses

    def finalize(self):
        """`finalize` is called only once when the model is being unloaded.
        Implementing `finalize` function is OPTIONAL. This function allows
        the model to perform any necessary clean ups before exit.
        """
        print('Cleaning up...')
gioipv commented 3 years ago

[Update]: This is the result of my model use perf_analyzer for performances measure on triton. I also implement perf_analyzer on torchserve.

This is the result:

Screenshot from 2021-08-27 15-09-48

Tabrizian commented 3 years ago

@gioipv Thanks for sharing the results.

What do you think about that: (triton's client API can take a long time)

Triton Python Client may add some latency because of using Python API which could be slower. Also, if you want to try concurrency values higher than 1 it would be harder to create the same scenario using Python client.