triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.34k stars 1.48k forks source link

Triton model is runnig at 30 FPS but triton_client.infer is returing aroud 15 FPS #6605

Closed FawadAbbas12 closed 12 months ago

FawadAbbas12 commented 12 months ago

Description There is almost 50% drop in FPS during transmission on the same system

Triton Information What version of Triton are you using? 2.22.0 Are you using the Triton container or did you build it yourself? Triton Container: nvcr.io/nvidia/tritonserver:22.05-py3 To Reproduce cannot share complete code here (If required then i can create a separate repo as backed code is quite big)but here is inference part

 import numpy as np
import cv2
import tritonclient.grpc as grpcclient
import sys
import argparse

def runtime_monitor(some_function):
    from time import time

    def wrapper(*args, **kwargs):
        t1 = time()
        result = some_function(*args, **kwargs)
        end = time()-t1
        print(f'{some_function.__name__} Time : {1/end}')
        return result
    return wrapper

class Infer_Engine():
    def __init__(self) -> None:
        self.inputs = []

    def setup_inputs(self, frame):
        if len(self.inputs):return
        self.inputs.append(grpcclient.InferInput('action_type', [1], "INT32"))
        self.inputs.append(grpcclient.InferInput('image', frame.shape, "INT32"))
        self.inputs.append(grpcclient.InferInput('inti_bbox', [4], "INT32"))
    def get_triton_client(self, url: str = 'localhost:8001'):
        try:
            keepalive_options = grpcclient.KeepAliveOptions(
                keepalive_time_ms=2**31 - 1,
                keepalive_timeout_ms=20000,
                keepalive_permit_without_calls=False,
                http2_max_pings_without_data=2
            )
            triton_client = grpcclient.InferenceServerClient(
                url=url,
                verbose=False,
                keepalive_options=keepalive_options)
        except Exception as e:
            print("channel creation failed: " + str(e))
            sys.exit()
        return triton_client

    def draw_bounding_box(self, img, class_id, confidence, x, y, x_plus_w, y_plus_h):
        label = f'({class_id}: {confidence:.2f})'
        color = (255, 0, )
        cv2.rectangle(img, (x, y), (x_plus_w, y_plus_h), color, 2)
        cv2.putText(img, label, (x - 10, y - 10),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)

    def init_model(self, frame: np.ndarray, bbox: np.ndarray,
                    triton_client: grpcclient.InferenceServerClient):
        outputs = []
        self.setup_inputs(frame)
        # Initialize the data
        self.inputs[0].set_data_from_numpy(np.array([0], dtype=np.int32))
        self.inputs[1].set_data_from_numpy(frame)
        self.inputs[2].set_data_from_numpy(bbox)

        # Test with outputs
        results = triton_client.infer(model_name='mixformer_conv_mae',
                                    inputs=self.inputs,
                                    outputs=outputs)
        return 

    @runtime_monitor
    def run_inference(self, frame: np.ndarray,
                    triton_client: grpcclient.InferenceServerClient):
        outputs = []
        self.setup_inputs(frame)

        # Initialize the data
        self.inputs[0].set_data_from_numpy(np.array([1], dtype=np.int32))
        self.inputs[1].set_data_from_numpy(frame)
        self.inputs[2].set_data_from_numpy(np.array([0,0,0,0], dtype=np.int32))

        outputs.append(grpcclient.InferRequestedOutput('target_bbox'))
        outputs.append(grpcclient.InferRequestedOutput('score'))

        # Test with outputs
        results = triton_client.infer(model_name='mixformer_conv_mae',
                                    inputs=self.inputs,
                                    outputs=outputs)
        target_bbox = results.as_numpy('target_bbox')
        score = results.as_numpy('score')

        return target_bbox, score

    def start(self, url):
        import time
        name = 'oc1'
        vid = cv2.VideoCapture(f'{name}.mp4')
        def read_frame():
            ret, frame = vid.read()    
            try:
                frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            except:pass
            finally:
                return frame, ret
        triton_client = self.get_triton_client(url)
        frame, ret = read_frame()
        import datetime
        if ret:
            frame_disp = frame.copy()
            cv2.putText(frame_disp, 'Select target ROI and press ENTER', (20, 30), cv2.FONT_HERSHEY_COMPLEX_SMALL, 1.5,
                        (0, 0, 0), 1)
            # x, y, w, h = cv2.selectROI('display_name', cv2.cvtColor(frame_disp, cv2.COLOR_RGB2BGR), fromCenter=False)
            # cv2.destroyAllWindows()
            # init_state = [x, y, w, h]
            init_state = np.array([852, 811, 56, 96], dtype=np.int32)
            print(init_state)
            init_info = {'init_bbox':init_state}
            self.init_model(frame.copy().astype(np.int32), init_state, triton_client)
            w,h= frame.shape[:2]
            frame_size = (h,w)
            name_id = datetime.datetime.now().__str__()
            writer= cv2.VideoWriter(
                f'{name}-{name_id}.mp4',
                cv2.VideoWriter_fourcc(*'MP4V'),
                10, 
                frame_size
            )
            t = time .time()
            idx = 0
            while ret:
                idx += 1
                target_bbox, score = self.run_inference(frame.copy().astype(np.int32), triton_client)
                r = list(map(int, target_bbox))
                writer.write(
                    cv2.rectangle(cv2.cvtColor(frame, cv2.COLOR_RGB2BGR), (r[0], r[1]), (r[0]+r[2], r[1]+r[3]), (255,255,255),4)
                )
                frame, ret = read_frame()
            e = time.time()
            print(f'total Time = {e-t} sec')
            print(f'FPS        = {idx/(e-t)} sec')
            vid.release()
            writer.release()

if __name__ == '__main__':
    ie = Infer_Engine()
    ie.start('localhost:8001')

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

name: "mixformer_conv_mae"
backend: "python"
max_batch_size: 0 
    input [
    {
        name: "action_type"
        data_type: TYPE_INT32
        dims: [1 ]
    },
    {
        name: "inti_bbox"
        data_type: TYPE_INT32
        dims: [4]
    },
    {
        name: "image"
        data_type: TYPE_INT32
        dims: [-1, -1, -1]
    }
]

output [
    {
      name: "target_bbox"
      data_type: TYPE_FP32
      dims: [1, 4]
    },
    {
      name: "score"
      data_type: TYPE_FP32
      dims: [1 ]
    }
]

instance_group [{ kind: KIND_GPU }]

Expected behavior A clear and concise description of what you expected to happen. I have also added a runtime monitor wrapper for TritonPythonModel class and it shows that mode have completed inference at 30 FPS but on receiver side it show that model's performance is around 15 FPS

kthui commented 12 months ago

Hi @FawadAbbas12, assuming you used the same

def runtime_monitor(some_function):
    from time import time

    def wrapper(*args, **kwargs):
        t1 = time()
        result = some_function(*args, **kwargs)
        end = time()-t1
        print(f'{some_function.__name__} Time : {1/end}')
        return result
    return wrapper

on the model to record the execution duration of the execute() function. The duration on the server side only includes execution duration, while the duration on the client side includes execution duration and data transmission duration, so this is not an apple-to-apple comparison. We recommend using the Triton Performance Analyzer to measure throughput, such that non-apple-to-apple comparisons can be avoided.

FawadAbbas12 commented 12 months ago

Sorry for not mentioning but I have also used pref analyzer for grpc endpoint and results are same it says that model can support 50 inference per sec whereas it return 25.

kthui commented 12 months ago

model can support 50 inference per sec whereas it return 25

I assume you means the perf analyzer benchmarked the throughput as 50 infer/sec, while you client only achieved 25 infer/sec?

I think the issue is here

results = triton_client.infer(model_name='mixformer_conv_mae',
                              inputs=self.inputs,
                              outputs=outputs)

where the next inference will wait until the previous inference is completed and returned before starting, so there will be a gRPC communication gap between inferences. One way to solve this is to use async_infer() instead of infer() to enable overlapping between inferences on the client. You could read more about async_infer() here https://github.com/triton-inference-server/client/blob/main/src/python/library/tritonclient/grpc/_client.py#L1567

FawadAbbas12 commented 12 months ago

Thanks I will try to test it and will report back results

FawadAbbas12 commented 12 months ago

model can support 50 inference per sec whereas it return 25

I assume you means the perf analyzer benchmarked the throughput as 50 infer/sec, while you client only achieved 25 infer/sec?

I think the issue is here

results = triton_client.infer(model_name='mixformer_conv_mae',
                              inputs=self.inputs,
                              outputs=outputs)

where the next inference will wait until the previous inference is completed and returned before starting, so there will be a gRPC communication gap between inferences. One way to solve this is to use async_infer() instead of infer() to enable overlapping between inferences on the client. You could read more about async_infer() here https://github.com/triton-inference-server/client/blob/main/src/python/library/tritonclient/grpc/_client.py#L1567

thanks @kthui for pointing out the issue when i use async_infer i get same fps :)