triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.27k stars 1.47k forks source link

how to config BLS model to instantiate multi BLS model #5874

Closed DequanZhu closed 1 year ago

DequanZhu commented 1 year ago

I noticed in triton python_backend examples, no model config.pbtxt file has configration like below:

  instance_group [
    {
      count: 2
      kind: KIND_GPU
    }
  ]

wihch is used in other triton backend like tensorrt_backend, dali_backend. So how should I config it to have multi python bls model instances to improve throughputs, beacuse I have a test and found out that only one bls instance significantly decrease system's throughputs. In this test, I have a bls model which consist of two stage:detection stage and classify stage, in the detection stage, bls exec a infer request for detection model, and depend on the detection box number, bls exec infer request for classfier model, if no detection box got for detection stage ,no infer request for classifier model. In the triton client side, I use tritonclient.ttp.aio client to post multi infer requests for bls model and test throughputs is 80, the test input is random generated so bls first stage produced no detection box and no infer for second stage. In this case, bls model is equivalent to detection stage, so I use same client to post multi infer requests for detection model alone and found out throughputs is nearly 150. So what caused the differece of throughputs, is it python GIL limit? how should I remove this limit?

DequanZhu commented 1 year ago

I try to use config like this for my python model to get multi instance:

  instance_group [
    {
      count: 2
      kind: KIND_GPU
    }
  ]

no errors produced when loaded it by triton, but when I post a infer request using triton client some errors occurred: Stub process is unhealthy and it will be restarted. when I change the config to just 1 instance:

  instance_group [
    {
      count: 1
      kind: KIND_GPU
    }
  ]

no errors occur when triton load model and exec infer request. My trtion container is r22.12 and is running in k8s, the docker shm-size is 4g,my python model is below:

import asyncio
import triton_python_backend_utils as pb_utils
import json
import numpy as np
import torch
from torch.utils.dlpack import from_dlpack
from skimage import transform as trans

FACE_DETECTOR_MODEL_NAME = "ensemble-yolov7-face-detector"
FACE_FEATURE_MODEL_NAME = "ensemble-align-arcface"

arcface_src = np.array(
    [
        [38.2946, 51.6963],
        [73.5318, 51.5014],
        [56.0252, 71.7366],
        [41.5493, 92.3655],
        [70.7299, 92.2041],
    ],
    dtype=np.float32,
)

class TritonPythonModel:
    def initialize(self, args):
        self.model_config = json.loads(args["model_config"])

    @staticmethod
    def estimate_norm(lmk):
        tform = trans.SimilarityTransform()
        tform.estimate(lmk, arcface_src)
        M = tform.params[0:2, :]
        return M

    async def detector_stage(self, input_data):
        detector_infer_request = pb_utils.InferenceRequest(
            model_name=FACE_DETECTOR_MODEL_NAME,
            requested_output_names=[
                "letterboxed_image",
                "num_dets",
                "det_boxes",
                "det_scores",
                "det_classes",
                "det_lmks",
                "det_lmks_mask",
                "shift",
                "scale_ratio",
            ],
            inputs=[input_data],
        )

        detector_inference_response_await = detector_infer_request.async_exec()
        detector_inference_response = await detector_inference_response_await

        if detector_inference_response.has_error():
            raise pb_utils.TritonModelException(
                detector_inference_response.error().message()
            )
        letterboxed_image_output_tensor = pb_utils.get_output_tensor_by_name(
            detector_inference_response, "letterboxed_image"
        )
        num_dets_output_tensor = pb_utils.get_output_tensor_by_name(
            detector_inference_response, "num_dets"
        )

        det_boxes_output_tensor = pb_utils.get_output_tensor_by_name(
            detector_inference_response, "det_boxes"
        )

        det_scores_output_tensor = pb_utils.get_output_tensor_by_name(
            detector_inference_response, "det_scores"
        )
        det_classes_output_tensor = pb_utils.get_output_tensor_by_name(
            detector_inference_response, "det_classes"
        )
        det_lmks_output_tensor = pb_utils.get_output_tensor_by_name(
            detector_inference_response, "det_lmks"
        )
        det_lmks_mask_output_tensor = pb_utils.get_output_tensor_by_name(
            detector_inference_response, "det_lmks_mask"
        )
        shift_output_tensor = pb_utils.get_output_tensor_by_name(
            detector_inference_response, "shift"
        )

        scale_ratio_output_tensor = pb_utils.get_output_tensor_by_name(
            detector_inference_response, "scale_ratio"
        )

        return (
            letterboxed_image_output_tensor,
            num_dets_output_tensor,
            det_boxes_output_tensor,
            det_scores_output_tensor,
            det_classes_output_tensor,
            det_lmks_output_tensor,
            det_lmks_mask_output_tensor,
            shift_output_tensor,
            scale_ratio_output_tensor,
        )

    async def execute(self, requests):
        responses = []
        for request in requests:
            orig_image_bytes = pb_utils.get_input_tensor_by_name(
                request, "orig_image_bytes"
            )
            (
                letterboxed_image_output_tensor,
                num_dets_output_tensor,
                det_boxes_output_tensor,
                det_scores_output_tensor,
                det_classes_output_tensor,
                det_lmks_output_tensor,
                det_lmks_mask_output_tensor,
                shift_output_tensor,
                scale_ratio_output_tensor,
            ) = await self.detector_stage(orig_image_bytes)

            num_det = (
                from_dlpack(num_dets_output_tensor.to_dlpack()).cpu().numpy()[0][0]
            )

            det_lmks_output_np = (
                from_dlpack(det_lmks_output_tensor.to_dlpack()).cpu().numpy()[0]
            )

            face_feature_inference_response_awaits = []

            for i in range(num_det):
                letterboxed_image_input_tensor = letterboxed_image_output_tensor
                det_lmk = det_lmks_output_np[i]
                det_lmk = det_lmk.reshape((-1, 2))
                warp_m = self.estimate_norm(det_lmk)
                warp_m = warp_m[np.newaxis, :]
                warp_m_input_tensor = pb_utils.Tensor(
                    "warp_m", warp_m.astype(np.float32)
                )
                warp_infer_request = pb_utils.InferenceRequest(
                    model_name=FACE_FEATURE_MODEL_NAME,
                    requested_output_names=["683"],
                    inputs=[letterboxed_image_input_tensor, warp_m_input_tensor],
                )
                face_feature_inference_response_awaits.append(
                    warp_infer_request.async_exec()
                )

            face_feature_inference_responses = await asyncio.gather(
                *face_feature_inference_response_awaits
            )
            for face_feature_infer_response in face_feature_inference_responses:
                if face_feature_infer_response.has_error():
                    raise pb_utils.TritonModelException(
                        face_feature_infer_response.error().message()
                    )
            face_feature_output_tensor_list = []
            for face_feature_infer_response in face_feature_inference_responses:
                align_face_output_tensor = pb_utils.get_output_tensor_by_name(
                    face_feature_infer_response, "683"
                )
                align_face_output_torch_tensor = from_dlpack(
                    align_face_output_tensor.to_dlpack()
                )
                face_feature_output_tensor_list.append(align_face_output_torch_tensor)            
            if face_feature_output_tensor_list:
                face_feature_output_tensors = pb_utils.Tensor(
                    "face_feature",
                    torch.vstack(face_feature_output_tensor_list).cpu().numpy().astype(np.float32),
                )
            else:
                face_feature_output_tensors = pb_utils.Tensor(
                    "face_feature",
                    np.asarray([]).astype(np.float32),
                )
            pipeline_inference_response = pb_utils.InferenceResponse(
                output_tensors=[
                    num_dets_output_tensor,
                    det_boxes_output_tensor,
                    det_scores_output_tensor,
                    det_classes_output_tensor,
                    det_lmks_output_tensor,
                    det_lmks_mask_output_tensor,                    
                    shift_output_tensor,
                    scale_ratio_output_tensor,
                    face_feature_output_tensors,
                ]
            )
            responses.append(pipeline_inference_response)
        return responses

So what caused this error, is this right way to get multi python model instance ?

rmccorm4 commented 1 year ago

Hi @DequanZhu ,

but when I post a infer request using triton client some errors occurred:Stub process is unhealthy and it will be restarted.

I believe there were several fixes/improvements to the Python backend since r22.12 in regards to these unhealthy stub issues. Can you try the latest version r23.05 and see if the issue persists?

dyastremsky commented 1 year ago

Closing due to inactivity. If you would like to reopen this issue for follow-up, please let us know.