Closed gioipv closed 3 years ago
@gioipv Could you please share your model.py
file. From your overall description, it seems like you are doing it correctly. What inferencing solution does "model inference" column use? Is it using the PyTorch backend?
How did you measure the performance of your Python models? Did you use Perf Analyzer?
@Tabrizian Thankyou for reply
http
request. For performance measure: I get the duration time from when I send a request to when I get a response at client. With throughput
: when server stable, (ignore the first request) I get the time of 1000 request and dividing it by 1000 to get the throughput
.model inference
column is implement of my model inference, (not using the sides: server and client). not pytorch backend
of triton. Simply, I get a image --> implement pre-process, and direct pass to my model --> post-process. model inference
column (such as pre-process, inference and post-process) are same with the processes in triton and torchserve. except in torchserve and triton have some own processes.(triton is execute
at TritonPythonModel
class and torchserve is handler
)model.py
:
import json
import numpy as np
import os import torch from torch2trt import TRTModule from torchvision import transforms import cv2 import triton_python_backend_utils as pb_utils
class eye_state_model(): def init(self, trained_path, gpu) -> None: self.gpu = gpu self.trained_path = trained_path self.device = self.device_initialize() self.model = self.load_model() def device_initialize(self):
return device
def load_model(self):
# do implement load model, load model from weight file and set to eval mode
return model_trt
def preprocess_img(self, img):
# do implemt prep-process
return img
def inference(self, data):
# model inference here
return result
def postprocess(self, data):
# do implemnt model post-process.
return predict_idx
initialize
is called only once when the model is being loaded.
Implementing initialize
function is optional. This function allows
the model to intialize any state associated with this model.
Parameters args : dict
Both keys and values are strings. The dictionary keys and values are:
* model_config: A JSON string containing the model configuration
* model_instance_kind: A string containing model instance kind
* model_instance_device_id: A string containing model instance device ID
* model_repository: Model repository path
* model_version: Model version
* model_name: Model name
"""
# You must parse model_config. JSON string is not parsed here
self.model_config = json.loads(args['model_config'])
# Get OUTPUT0 configuration
output0_config = pb_utils.get_output_config_by_name(self.model_config, "OUTPUT0")
# Convert Triton types to numpy types
self.output0_dtype = pb_utils.triton_string_to_numpy(output0_config['data_type'])
# DOING: initialize model from trained model path
trained_model_path = self.model_config["parameters"]['TRAINED_PATH']['string_value']
print("load model from trained model path:", trained_model_path)
self.model = eye_state_model(trained_path=trained_model_path, gpu=0)
print("load model successful\n")
def execute(self, requests):
"""`execute` MUST be implemented in every Python model. `execute`
function receives a list of pb_utils.InferenceRequest as the only
argument. This function is called when an inference request is made
for this model. Depending on the batching configuration (e.g. Dynamic
Batching) used, `requests` may contain multiple requests. Every
Python model, must create one pb_utils.InferenceResponse for every
pb_utils.InferenceRequest in `requests`. If there is an error, you can
set the error argument when creating a pb_utils.InferenceResponse
Parameters
----------
requests : list
A list of pb_utils.InferenceRequest
Returns
-------
list
A list of pb_utils.InferenceResponse. The length of this list must
be the same as `requests`
"""
output0_dtype = self.output0_dtype
responses = []
for request in requests:
# print('print request:', request)
# Get INPUT0
in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
img = in_0.as_numpy()
img = self.model.preprocess_img(img)
infer = self.model.inference(img)
idx = self.model.postprocess(infer)
# print("output response:", idx)
out_tensor_0 = pb_utils.Tensor("OUTPUT0", idx.astype(output0_dtype))
inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor_0])
responses.append(inference_response)
return responses
def finalize(self):
"""`finalize` is called only once when the model is being unloaded.
Implementing `finalize` function is OPTIONAL. This function allows
the model to perform any necessary clean ups before exit.
"""
print('Cleaning up...')
I think you have shared a blueprint of your model. In this file that you've shared I can't see any issues. You need to make sure that you are not using array slices or NumPy functions that could make a copy of your tensor. Using perf analyzer is very easy and can make sure that there are not any bugs in your performance measurement methods. You just need to run perf_analyzer -m <your_model_name>
. Would be great if you can share the perf analyzer numbers for the same baselines.
srr, I just want my comment to be brief. so, this is full code of my model.py
:
import json
import numpy as np
import os
import torch
from torch2trt import TRTModule
from torchvision import transforms
import cv2
import triton_python_backend_utils as pb_utils
class resize_img(object):
def __init__(self, img_size=(224, 224)):
self.dsize = img_size # (width, height)
def __call__(self, img):
assert type(img).__module__ == 'numpy'
resized = cv2.resize(img, self.dsize, interpolation = cv2.INTER_AREA)
return resized
class normalize(object):
def __init__(self, mode='rgb') -> None:
self.mode = mode
self.normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
super().__init__()
def __call__(self, image_tensor):
assert type(image_tensor).__module__ == 'torch'
if image_tensor.size(-1) in [3, 1]:
# Need convert to (C,H,W) : https://pytorch.org/vision/stable/_modules/torchvision/transforms/transforms.html#RandomHorizontalFlip
image_tensor = image_tensor.permute(2, 0, 1)
if self.mode == 'rgb':
assert image_tensor.size(0) == 3
#Normalize RGB img
compose = transforms.Compose([self.normalize])
image_tensor = compose(image_tensor)
return image_tensor
elif self.mode == 'gray':
# assert image_tensor.size(0) == 1:
# Normalize gray img
image_tensor = image_tensor/255.
return image_tensor
return None
class eye_state_model():
def __init__(self, trained_path, gpu) -> None:
self.gpu = gpu
self.trained_path = trained_path
self.device = self.device_initialize()
self.model = self.load_model()
self.composed = transforms.Compose([
resize_img(),
transforms.ToTensor(),
normalize()])
def device_initialize(self):
# device = 'cuda:{}'.format(self.gpu)
device = 'cuda'
return device
def load_model(self):
model_trt = TRTModule()
model_trt.load_state_dict(torch.load(self.trained_path, map_location='cpu'))
model_trt.eval()
model_trt.to(self.device)
return model_trt
def preprocess_img(self, img):
img = self.composed(img)
img = img.unsqueeze(0) # add a dimension to fix format
return img
def inference(self, data):
data = data.to(self.device)
result = self.model(data)
return result
def postprocess(self, data):
predict_idx = data.argmax(1).cpu().detach().numpy()[0]
return predict_idx
class TritonPythonModel:
"""
Your Python model must use the same class name. Every Python model
that is created must have "TritonPythonModel" as the class name.
"""
def initialize(self, args):
"""`initialize` is called only once when the model is being loaded.
Implementing `initialize` function is optional. This function allows
the model to intialize any state associated with this model.
Parameters
----------
args : dict
Both keys and values are strings. The dictionary keys and values are:
* model_config: A JSON string containing the model configuration
* model_instance_kind: A string containing model instance kind
* model_instance_device_id: A string containing model instance device ID
* model_repository: Model repository path
* model_version: Model version
* model_name: Model name
"""
# You must parse model_config. JSON string is not parsed here
self.model_config = json.loads(args['model_config'])
# Get OUTPUT0 configuration
output0_config = pb_utils.get_output_config_by_name(self.model_config, "OUTPUT0")
# Convert Triton types to numpy types
self.output0_dtype = pb_utils.triton_string_to_numpy(output0_config['data_type'])
# DOING: initialize model from trained model path
trained_model_path = self.model_config["parameters"]['TRAINED_PATH']['string_value']
print("load model from trained model path:", trained_model_path)
self.model = eye_state_model(trained_path=trained_model_path, gpu=0)
print("load model successful\n")
def execute(self, requests):
"""`execute` MUST be implemented in every Python model. `execute`
function receives a list of pb_utils.InferenceRequest as the only
argument. This function is called when an inference request is made
for this model. Depending on the batching configuration (e.g. Dynamic
Batching) used, `requests` may contain multiple requests. Every
Python model, must create one pb_utils.InferenceResponse for every
pb_utils.InferenceRequest in `requests`. If there is an error, you can
set the error argument when creating a pb_utils.InferenceResponse
Parameters
----------
requests : list
A list of pb_utils.InferenceRequest
Returns
-------
list
A list of pb_utils.InferenceResponse. The length of this list must
be the same as `requests`
"""
output0_dtype = self.output0_dtype
responses = []
for request in requests:
print('print request:', request)
# Get INPUT0
in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
img = in_0.as_numpy()
img = self.model.preprocess_img(img)
infer = self.model.inference(img)
idx = self.model.postprocess(infer)
# print("output response:", idx)
out_tensor_0 = pb_utils.Tensor("OUTPUT0", idx.astype(output0_dtype))
inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor_0])
responses.append(inference_response)
return responses
def finalize(self):
"""`finalize` is called only once when the model is being unloaded.
Implementing `finalize` function is OPTIONAL. This function allows
the model to perform any necessary clean ups before exit.
"""
print('Cleaning up...')
[Update]:
This is the result of my model use perf_analyzer
for performances measure on triton. I also implement perf_analyzer
on torchserve.
perf_analyzer --service-kind triton -m eye_state_model -v -u http://172.16.19.68:8080 --percentile=95 --input-data /home/gioipv/workspaces/triton_template/client/perf_analyzer/data_test.json
torchserve
perf_analyzer --service-kind torchserve -m eye_state_model --percentile=95 -v -u http://127.0.0.1:4886 --input-data /home/gioipv/workspaces/ekyc_fault_detection/ekyc_eye_state/perf_analyzer/data_test.json
The fomat in the *.json files in each server ( triton and torchserve) are difference but are same in content.
This is the result:
triton's client API can take a long time
)@gioipv Thanks for sharing the results.
What do you think about that: (triton's client API can take a long time)
Triton Python Client may add some latency because of using Python API which could be slower. Also, if you want to try concurrency values higher than 1 it would be harder to create the same scenario using Python client.
Description Hello
throughput
of triton is lower the throughput of torchserve. Despite of I used tensorrt model in triton (the throughput of tensorrt model inference are faster than pytorch model inference)Triton Information
Version triton 21.07
To Reproduce
This is what i did:
model.py
file, adding the weight of the tensorrt trained model here, creating config.pbtxt file, build stub and conda-pack my envs, ...)model.py
atTritonPythonModel
class:initialize
functionexcuse
funcion, I implement my model inference and convert to format for triton response.Expected behavior
Am I doing it right?. If I'm wrong can you point me in the right direction.