Triton Inference Server taking adding 3 seconds to get YOLOv4 Inference

Hello,

Just recently setup Triton to work with our YoloV4/TensorRT model. However, I noticed that inference is taking a long time, ~3 seconds. This didn't seem right, so I fetched the verbose logs and got this back

I0320 05:09:58.334630 1 grpc_server.cc:272] Process for ServerLive, rpc_ok=1, 3 step START
I0320 05:09:58.334692 1 grpc_server.cc:227] Ready for RPC 'ServerLive', 4
I0320 05:09:58.334773 1 grpc_server.cc:272] Process for ServerLive, rpc_ok=1, 3 step COMPLETE
I0320 05:09:58.334785 1 grpc_server.cc:413] Done for ServerLive, 3
I0320 05:09:58.405717 1 grpc_server.cc:272] Process for ServerReady, rpc_ok=1, 3 step START
I0320 05:09:58.405777 1 grpc_server.cc:227] Ready for RPC 'ServerReady', 4
I0320 05:09:58.405796 1 model_repository_manager.cc:522] ModelStates()
I0320 05:09:58.405874 1 grpc_server.cc:272] Process for ServerReady, rpc_ok=1, 3 step COMPLETE
I0320 05:09:58.405887 1 grpc_server.cc:413] Done for ServerReady, 3
I0320 05:09:58.477041 1 grpc_server.cc:272] Process for ModelReady, rpc_ok=1, 3 step START
I0320 05:09:58.477095 1 grpc_server.cc:227] Ready for RPC 'ModelReady', 4
I0320 05:09:58.477112 1 model_repository_manager.cc:590] GetModel() 'yolov4' version -1
I0320 05:09:58.477189 1 grpc_server.cc:272] Process for ModelReady, rpc_ok=1, 3 step COMPLETE
I0320 05:09:58.477202 1 grpc_server.cc:413] Done for ModelReady, 3
I0320 05:10:01.545416 1 grpc_server.cc:3420] Process for ModelInferHandler, rpc_ok=1, 6 step START
I0320 05:10:01.545473 1 grpc_server.cc:3413] New request handler for ModelInferHandler, 8
I0320 05:10:01.545488 1 model_repository_manager.cc:590] GetModel() 'yolov4' version -1
I0320 05:10:01.545503 1 model_repository_manager.cc:590] GetModel() 'yolov4' version -1
I0320 05:10:01.545589 1 infer_request.cc:675] prepared: [0x0x7fdef4006340] request id: , model: yolov4, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
original inputs:
[0x0x7fdee8007f38] input: input, type: FP32, original shape: [1,3,416,416], batch + shape: [1,3,416,416], shape: [1,3,416,416]
override inputs:
inputs:
[0x0x7fdee8007f38] input: input, type: FP32, original shape: [1,3,416,416], batch + shape: [1,3,416,416], shape: [1,3,416,416]
original requested outputs:
boxes
requested outputs:
boxes
I0320 05:10:01.545749 1 tensorrt.cc:5376] model yolov4, instance yolov4, executing 1 requests
I0320 05:10:01.545785 1 tensorrt.cc:1609] TRITONBACKEND_ModelExecute: Issuing yolov4 with 1 requests
I0320 05:10:01.545796 1 tensorrt.cc:1668] TRITONBACKEND_ModelExecute: Running yolov4 with 1 requests
I0320 05:10:01.545842 1 tensorrt.cc:2804] Optimization profile default [0] is selected for yolov4
I0320 05:10:01.545896 1 pinned_memory_manager.cc:161] pinned memory allocation: size 2076672, addr 0x7fdff8000090
I0320 05:10:01.546688 1 tensorrt.cc:2181] Context with profile default [0] is being executed for yolov4
I0320 05:10:01.549341 1 infer_response.cc:166] add response output: output: boxes, type: FP32, shape: [1,10647,1,4]
I0320 05:10:01.549391 1 grpc_server.cc:2498] GRPC: unable to provide 'boxes' in GPU, will use CPU
I0320 05:10:01.549439 1 grpc_server.cc:2509] GRPC: using buffer for 'boxes', size: 170352, addr: 0x7fdf5800e960
I0320 05:10:01.549454 1 pinned_memory_manager.cc:161] pinned memory allocation: size 170352, addr 0x7fdff81fb0a0
I0320 05:10:01.558050 1 grpc_server.cc:3572] ModelInferHandler::InferResponseComplete, 6 step ISSUED
I0320 05:10:01.558085 1 grpc_server.cc:2591] GRPC free: size 170352, addr 0x7fdf5800e960
I0320 05:10:01.558338 1 grpc_server.cc:3148] ModelInferHandler::InferRequestComplete
I0320 05:10:01.558373 1 tensorrt.cc:2661] TRITONBACKEND_ModelExecute: model yolov4 released 1 requests
I0320 05:10:01.558389 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7fdff81fb0a0
I0320 05:10:01.558407 1 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7fdff8000090
I0320 05:10:01.558675 1 grpc_server.cc:3420] Process for ModelInferHandler, rpc_ok=1, 6 step COMPLETE
I0320 05:10:01.558696 1 grpc_server.cc:2419] Done for ModelInferHandler, 6

So it seems that the actual inference only takes ~10-15 milliseconds, which seems normal. However, before that, the logs show

I0320 05:09:58.477189 1 grpc_server.cc:272] Process for ModelReady, rpc_ok=1, 3 step COMPLETE
I0320 05:09:58.477202 1 grpc_server.cc:413] Done for ModelReady, 3
I0320 05:10:01.545416 1 grpc_server.cc:3420] Process for ModelInferHandler, rpc_ok=1, 6 step START
I0320 05:10:01.545473 1 grpc_server.cc:3413] New request handler for ModelInferHandler, 8

I tried to search in the docs, on Google and in other issues to figure out what exactly is going on, and why this is taking 3 seconds to run, but didn't find much unfortunately. At first I thought it was because of model state, but according to the docs, CNNs like YOLO should be stateless. I'm unsure how to resolve this.

This is the config.pbtxt

name: "yolov4"
platform: "tensorrt_plan"
dynamic_batching { }
input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ 1,  3, 416, 416 ]
  }
]
output [
  {
    name: "boxes"
    data_type: TYPE_FP32
    dims: [ 1, 10647, 1, 4 ]
  },
  {
    name: "confs"
    data_type: TYPE_FP32
    dims: [ 1, 10647, 18 ]
  }
]

and this is how the docker container running Triton is deployed


docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /home/name/models:/models nvcr.io/nvidia/tritonserver:22.02-py3 tritonserver --model-repository=/models --strict-model-config=false --grpc-infer-allocation-pool-size=16 --log-verbose 1```

Hi, please verify your GPU is actually running this by looking at nvidia-smi. Also try to use tritons own performance analysis tool that you can find bundled with their SDK or packaged in the Docker Images and test your endpoint using that tool to ensure it's not a mistake in your client side code.

I checked nvidia-smi and it does indeed load the model into GPU memory. However during inference, memory does not actually increase in usage, which I am assuming it should. So maybe on inference request, it reverts to using the CPU for inference, hence why it takes much longer.

However, when running the model, the following output is produced

{
    "name": "yolov4",
    "platform": "tensorrt_plan",
    "backend": "tensorrt",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 0,
    "input": [
        {
            "name": "input",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                1,
                3,
                416,
                416
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        }
    ],
    "output": [
        {
            "name": "boxes",
            "data_type": "TYPE_FP32",
            "dims": [
                1,
                10647,
                1,
                4
            ],
            "label_filename": "",
            "is_shape_tensor": false
        },
        {
            "name": "confs",
            "data_type": "TYPE_FP32",
            "dims": [
                1,
                10647,
                18
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "dynamic_batching": {
        "preferred_batch_size": [],
        "max_queue_delay_microseconds": 0,
        "preserve_ordering": false,
        "priority_levels": 0,
        "default_priority_level": 0,
        "priority_queue_policy": {}
    },
    "instance_group": [
        {
            "name": "yolov4",
            "kind": "KIND_GPU",
            "count": 1,
            "gpus": [
                0
            ],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "model.plan",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {},
    "model_warmup": []
}

Additionally, according to the logs from my initial post, the following occurs

I0320 05:10:01.545749 1 tensorrt.cc:5376] model yolov4, instance yolov4, executing 1 requests
I0320 05:10:01.545785 1 tensorrt.cc:1609] TRITONBACKEND_ModelExecute: Issuing yolov4 with 1 requests
I0320 05:10:01.545796 1 tensorrt.cc:1668] TRITONBACKEND_ModelExecute: Running yolov4 with 1 requests
I0320 05:10:01.545842 1 tensorrt.cc:2804] Optimization profile default [0] is selected for yolov4
I0320 05:10:01.545896 1 pinned_memory_manager.cc:161] pinned memory allocation: size 2076672, addr 0x7fdff8000090
I0320 05:10:01.546688 1 tensorrt.cc:2181] Context with profile default [0] is being executed for yolov4
I0320 05:10:01.549341 1 infer_response.cc:166] add response output: output: boxes, type: FP32, shape: [1,10647,1,4]
I0320 05:10:01.549391 1 grpc_server.cc:2498] GRPC: unable to provide 'boxes' in GPU, will use CPU
I0320 05:10:01.549439 1 grpc_server.cc:2509] GRPC: using buffer for 'boxes', size: 170352, addr: 0x7fdf5800e960
I0320 05:10:01.549454 1 pinned_memory_manager.cc:161] pinned memory allocation: size 170352, addr 0x7fdff81fb0a0
I0320 05:10:01.558050 1 grpc_server.cc:3572] ModelInferHandler::InferResponseComplete, 6 step ISSUED
I0320 05:10:01.558085 1 grpc_server.cc:2591] GRPC free: size 170352, addr 0x7fdf5800e960
I0320 05:10:01.558338 1 grpc_server.cc:3148] ModelInferHandler::InferRequestComplete
I0320 05:10:01.558373 1 tensorrt.cc:2661] TRITONBACKEND_ModelExecute: model yolov4 released 1 requests

So it seems the actual inference request only takes ~10-12 milliseconds. But the total round trip time is still 3 seconds.

This is the client script I am using to send the infer requests

def main():
    triton_client = grpcclient.InferenceServerClient(
            url=FLAGS.url,
            verbose=FLAGS.verbose)

    inputs = []
    outputs = []
    inputs.append(grpcclient.InferInput('input', [1, 3, FLAGS.width, FLAGS.height], "FP32"))
    outputs.append(grpcclient.InferRequestedOutput('boxes'))

    print("Creating buffer from image file...")
    input_image = cv2.imread(str(FLAGS.input))
    if input_image is None:
        print(f"FAILED: could not load input image {str(FLAGS.input)}")
        sys.exit(1)

    input_image_buffer = preprocess(input_image, [FLAGS.width, FLAGS.height])
    input_image_buffer = np.expand_dims(input_image_buffer, axis=0)

    inputs[0].set_data_from_numpy(input_image_buffer)

    print("Invoking inference...")

    print(int(time.time()))
    results = triton_client.infer(model_name=FLAGS.model,
                                      inputs=inputs,
                                      outputs=outputs)
    result = results.as_numpy('boxes')
    # print(f"Received result buffer of size {result.shape}")
    # print(f"Naive buffer sum: {np.sum(result)}")

    detected_objects = postprocess(result, input_image.shape[1], input_image.shape[0], [FLAGS.width, FLAGS.height], FLAGS.confidence, FLAGS.nms)
    #print(f"Detected objects: {len(detected_objects)}")

    print(int(time.time()))

I ran the performance analysis tool and these are the results

According to the perf_client everything is working as expected. I don't see where you measure 3 seconds round trip.

Please check the very first post in this issue, I pasted the logs in it.

Below is the output from the client.py script I sent above. As you can see it prints out the current time (epoch) right before it sends the request to the server for inference and then prints out the current time after it has gotten the result. The time difference is 3 seconds.

In the logs I sent in the first post, the very first line prints out

I0320 05:09:58.334630 1 grpc_server.cc:272] Process for ServerLive, rpc_ok=1, 3 step START

The very last line prints out

I0320 05:10:01.558696 1 grpc_server.cc:2419] Done for ModelInferHandler, 6

As you can see, it started at 05:09:58 and finished at 05:10:01 - a 3 second difference.

I don't believe the performance client is an accurate measurement, because I don't think it is sending an actual image. The command I used to run the perf client is as below

perf_client -m yolov4 -u 127.0.0.1:8001 -i grpc --shared-memory system --concurrency-range 4

The perf_client does send an actual image - you can even specify the image if you want. The problem definitely is in the client code. Please try the client.py from this repo and see if there is a difference. I'm unsure about the triton logs and what the timings mean in more detail.

It seems it was an error in the client - what is was exactly I am unsure of.

However, another problem seems to have come up - the detections from the model don't seem correct. The client seems to be looking for "detections". However, Trition only allows "boxes" and "confs" as the output now. I replaced "detections" in the client.py script with "boxes" and kept everything else the same. But when I did that, the detections were all messed up (see image below).

We tested this exact image with our model before Triton, with just TensorRT and it detected everything as expected, so it certainly cannot be because of the model.

No the output of the newest version of our networks is called "detections" and consists of just one array, please check again.

isarsoft / yolov4-triton-tensorrt

Triton Inference Server taking adding 3 seconds to get YOLOv4 Inference #65