isarsoft / yolov4-triton-tensorrt

This repository deploys YOLOv4 as an optimized TensorRT engine to Triton Inference Server
http://www.isarsoft.com
Other
276 stars 63 forks source link

multiple model instances issue #52

Closed ontheway16 closed 2 years ago

ontheway16 commented 2 years ago

Hi,

After dealing long with kubernetes based solutions, I have switched to celery. currently I am able to feed the tritonserver (20.08) )good enough, despite some CPU bottleneck. But GPU is still staying on low side, utiilization ranging between 20-60%.

I was already setting config to 8 instances, but decided to give dynamic_batching a try and set the config to following;

name: "yolov4"
platform: "tensorrt_plan"
max_batch_size: 64
input [
  {
    name: "data"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 608, 608 ]
  }
]
output [
  {
    name: "prob"
    data_type: TYPE_FP32
    dims: [7001, 1, 1]
  }
]
instance_group [
  {
    count: 8
    kind: KIND_GPU
  }
]
dynamic_batching {
  preferred_batch_size: [1,2,4,8,16,32,64]
  max_queue_delay_microseconds: 100
}

It failed to function and I take a look to triton logs, found this;

tritonserver_1    | I1012 18:15:51.073224 1 logging.cc:52] Deserialize required 2748834 microseconds.
tritonserver_1    | I1012 18:15:51.116093 1 autofill.cc:225] TensorRT autofill: OK: 
tritonserver_1    | I1012 18:15:51.116163 1 model_config_utils.cc:629] autofilled config: name: "yolov4"
tritonserver_1    | platform: "tensorrt_plan"
tritonserver_1    | max_batch_size: 1
tritonserver_1    | input {
tritonserver_1    |   name: "data"
tritonserver_1    |   data_type: TYPE_FP32
tritonserver_1    |   format: FORMAT_NCHW
tritonserver_1    |   dims: 3
tritonserver_1    |   dims: 608
tritonserver_1    |   dims: 608
tritonserver_1    | }
tritonserver_1    | output {
tritonserver_1    |   name: "prob"
tritonserver_1    |   data_type: TYPE_FP32
tritonserver_1    |   dims: 7001
tritonserver_1    |   dims: 1
tritonserver_1    |   dims: 1
tritonserver_1    | }
tritonserver_1    | instance_group {
tritonserver_1    |   count: 2
tritonserver_1    |   kind: KIND_GPU
tritonserver_1    | }
tritonserver_1    | default_model_filename: "model.plan"
tritonserver_1    | 
tritonserver_1    | I1012 18:15:51.116615 1 model_repository_manager.cc:618] AsyncLoad() 'resnet50_pytorch'
tritonserver_1    | I1012 18:15:51.116639 1 model_repository_manager.cc:680] TriggerNextAction() 'resnet50_pytorch' version 1: 1
tritonserver_1    | I1012 18:15:51.116652 1 model_repository_manager.cc:718] Load() 'resnet50_pytorch' version 1
tritonserver_1    | I1012 18:15:51.116659 1 model_repository_manager.cc:737] loading: resnet50_pytorch:1
tritonserver_1    | I1012 18:15:51.116760 1 model_repository_manager.cc:618] AsyncLoad() 'yolov4'
tritonserver_1    | I1012 18:15:51.116778 1 model_repository_manager.cc:680] TriggerNextAction() 'yolov4' version 1: 1
tritonserver_1    | I1012 18:15:51.116787 1 model_repository_manager.cc:718] Load() 'yolov4' version 1
tritonserver_1    | I1012 18:15:51.116794 1 model_repository_manager.cc:737] loading: yolov4:1
tritonserver_1    | I1012 18:15:51.116766 1 model_repository_manager.cc:790] CreateInferenceBackend() 'resnet50_pytorch' version 1
tritonserver_1    | I1012 18:15:51.116867 1 model_repository_manager.cc:790] CreateInferenceBackend() 'yolov4' version 1
tritonserver_1    | I1012 18:15:51.184996 1 libtorch_backend.cc:220] Creating instance resnet50_pytorch_0_0_gpu0 on GPU 0 (6.1) using model.pt
celery_default_1  | [2021-10-12 18:15:51,281: DEBUG/MainProcess] pidbox received method enable_events() [reply_to:None ticket:None]
celery_cpu_1      | [2021-10-12 18:15:51,282: DEBUG/MainProcess] pidbox received method enable_events() [reply_to:None ticket:None]
tritonserver_1    | W1012 18:15:52.596620 1 logging.cc:46] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
tritonserver_1    | I1012 18:15:52.610466 1 dynamic_batch_scheduler.cc:216] Starting dynamic-batch scheduler thread 0 at nice 5...
tritonserver_1    | I1012 18:15:52.616167 1 model_repository_manager.cc:925] successfully loaded 'resnet50_pytorch' version 1
tritonserver_1    | I1012 18:15:52.616201 1 model_repository_manager.cc:680] TriggerNextAction() 'resnet50_pytorch' version 1: 0
tritonserver_1    | I1012 18:15:52.616208 1 model_repository_manager.cc:695] no next action, trigger OnComplete()
tritonserver_1    | I1012 18:15:53.736514 1 logging.cc:52] Deserialize required 2121911 microseconds.
tritonserver_1    | I1012 18:15:53.736539 1 plan_backend.cc:331] Creating instance yolov4_0_0_gpu0 on GPU 0 (6.1) using model.plan
tritonserver_1    | I1012 18:15:53.739256 1 plan_backend.cc:503] Detected data as execution binding for yolov4
tritonserver_1    | I1012 18:15:53.739281 1 plan_backend.cc:503] Detected prob as execution binding for yolov4
tritonserver_1    | I1012 18:15:53.739459 1 plan_backend.cc:649] Created instance yolov4_0_0_gpu0 on GPU 0 with stream priority 0
tritonserver_1    | I1012 18:15:53.739595 1 plan_backend.cc:331] Creating instance yolov4_0_1_gpu0 on GPU 0 (6.1) using model.plan
tritonserver_1    | I1012 18:15:53.742664 1 plan_backend.cc:503] Detected data as execution binding for yolov4
tritonserver_1    | I1012 18:15:53.742688 1 plan_backend.cc:503] Detected prob as execution binding for yolov4
tritonserver_1    | I1012 18:15:53.742868 1 plan_backend.cc:649] Created instance yolov4_0_1_gpu0 on GPU 0 with stream priority 0
tritonserver_1    | I1012 18:15:53.742954 1 dynamic_batch_scheduler.cc:216] Starting dynamic-batch scheduler thread 0 at nice 5...
tritonserver_1    | I1012 18:15:53.743034 1 plan_backend.cc:356] plan backend for yolov4
tritonserver_1    | name=yolov4
tritonserver_1    | contexts:
tritonserver_1    |   name=yolov4_0_0_gpu0, gpu=0, max_batch_size=1
tritonserver_1    |   bindings:
tritonserver_1    |     0: max possible byte_size=4435968, buffer=0x7fd194800000 ]
tritonserver_1    |     1: max possible byte_size=28004, buffer=0x7fd26f647600 ]
tritonserver_1    |   name=yolov4_0_1_gpu0, gpu=0, max_batch_size=1
tritonserver_1    |   bindings:
tritonserver_1    |     0: max possible byte_size=4435968, buffer=0x7fd176800000 ]
tritonserver_1    |     1: max possible byte_size=28004, buffer=0x7fd26f694e00 ]
tritonserver_1    | 
tritonserver_1    | I1012 18:15:53.761424 1 model_repository_manager.cc:925] successfully loaded 'yolov4' version 1

Apparently, my 8 instances is not in place, either. Tritonserver allocates only 2.6GB GPU memory, out of 11.

Is there anything I can do, to fix model instances and dynamic_batching ? I believe if I can run more than two instances, I will not need dyn_batch stuff..

edit. quoted config values above were wrong, fixed.

philipp-schmidt commented 2 years ago

According to your log everything is working fine. Please provide more details.

philipp-schmidt commented 2 years ago

I never had to touch multiple instances, one should be sufficient. Preferred batchsize should be actual "batched" sizes. Try [4,16].

philipp-schmidt commented 2 years ago

It is also very hard to CPU bottleneck tritonserver. It is probably your client script which is bottlenecking.

philipp-schmidt commented 2 years ago

Please provide output of perf_client tests, look it up in the docu.

ontheway16 commented 2 years ago

According to your log everything is working fine. Please provide more details.

Hi Philipp,

The log clearly indicates there are 2 instances, despite the 8 setting in config. That was my point. If log says 2 instances, isnt it mean there are 2 instances for certain?

�[35mtritonserver_1    |�[0m I1012 18:15:53.739459 1 plan_backend.cc:649] Created instance yolov4_0_0_gpu0 on GPU 0 with stream priority 0
...
...
...
�[35mtritonserver_1    |�[0m I1012 18:15:53.742868 1 plan_backend.cc:649] Created instance yolov4_0_1_gpu0 on GPU 0 with stream priority 0
ontheway16 commented 2 years ago

It is also very hard to CPU bottleneck tritonserver. It is probably your client script which is bottlenecking.

I think the NMS code in the client (processing.py) is causing this CPU bottleneck. any hopes for GPU based NMS ? Screenshot from 2021-10-12 18-02-17

philipp-schmidt commented 2 years ago

Your triton log says:

�[35mtritonserver_1    |�[0m instance_group {
�[35mtritonserver_1    |�[0m   count: 2
�[35mtritonserver_1    |�[0m   kind: KIND_GPU
�[35mtritonserver_1    |�[0m }

Check that you're loading the right config

ontheway16 commented 2 years ago

It appears you are right, celery dockerized one was using a copy of model directory, somewhere else... Sorry for this inconvenience..

Tried dynamic batching since no performance gains even with 12 instances, but it says;

tritonserver_1 | E1012 21:33:08.045394 1 model_repository_manager.cc:1633] unable to autofill for 'yolov4', configuration specified max-batch 4 but TensorRT engine only supports max-batch 1

Is this coming from a setting in main.cpp ? Or yolov4 model cannot/should not be batched ?

philipp-schmidt commented 2 years ago

https://github.com/isarsoft/yolov4-triton-tensorrt/blob/5bdc4900e6f9c48a2e9c29c8bee78b2098d4ba69/main.cpp#L19

Set this to at least 16 and make sure you only have one instance, or you will run out of VRAM. Triton suggests you use dynamic batching over more instances in the docu.

philipp-schmidt commented 2 years ago

[4,8,16] can work better. You can test performance very accurately with the perf_client. Would be appreciated if you could do that and report the speedup that you get so others can see the benefit of dynamic batching. Tutorial is in the README of this repo.

ontheway16 commented 2 years ago

Thanks for indicating this, first I will switch back from custom http client to original grpc one, and see if there are performance gains. I am still thinking, topping the CPUs probably be coming from NMS code..? What do you think?

philipp-schmidt commented 2 years ago

High CPU load is due to preprocessing I think. It resizes your images, so this already will depend on the resolution of the images that you use (Resizing FullHD can take a CPU a few milliseconds). If you want to lower CPU usage you probably need to optimize the preprocessing method.

https://github.com/isarsoft/yolov4-triton-tensorrt/blob/5bdc4900e6f9c48a2e9c29c8bee78b2098d4ba69/clients/python/processing.py#L6-L35

philipp-schmidt commented 2 years ago

You can try to benchmark this particular method to see how much of it your CPU can take and then start to optimize. And check postprocessing (NMS etc.) as well, but I don't think that's the bottleneck.

ontheway16 commented 2 years ago

after some tests, it appears, celery operations are creating the visible cpu load, tried with a normal (non-celery) triton server, with multiple files in multi loop, all processes (pre, normal and post) consumed about 10%, triton itself was like 1.4%. Under celery, this was around 25-35. Also saw there is no difference between http and grpc.

philipp-schmidt commented 2 years ago

Reopen if you still have any issues with this