error: creating server: Internal - failed to load all models - NVIDIA Triton Server for YOLOv4

I have created YOLO plugin and it is working fine as per log.

RUNNING TensorRT.trtexec [TensorRT v8203] # /usr/src/tensorrt/bin/trtexec --loadEngine=/src/WORK-SPACE/yolov4-triton-tensorrt/yolov4.engine --plugins=/src/WORK-SPACE/yolov4-triton-tensorrt/build/libyoloplugin.so
[05/26/2022-07:48:15] [I] === Model Options ===
[05/26/2022-07:48:15] [I] Format: *
[05/26/2022-07:48:15] [I] Model: 
[05/26/2022-07:48:15] [I] Output:
[05/26/2022-07:48:15] [I] === Build Options ===
[05/26/2022-07:48:15] [I] Max batch: 1
[05/26/2022-07:48:15] [I] Workspace: 16 MiB
[05/26/2022-07:48:15] [I] minTiming: 1
[05/26/2022-07:48:15] [I] avgTiming: 8
[05/26/2022-07:48:15] [I] Precision: FP32
[05/26/2022-07:48:15] [I] Calibration: 
[05/26/2022-07:48:15] [I] Refit: Disabled
[05/26/2022-07:48:15] [I] Sparsity: Disabled
[05/26/2022-07:48:15] [I] Safe mode: Disabled
[05/26/2022-07:48:15] [I] DirectIO mode: Disabled
[05/26/2022-07:48:15] [I] Restricted mode: Disabled
[05/26/2022-07:48:15] [I] Save engine: 
[05/26/2022-07:48:15] [I] Load engine: /src/WORK-SPACE/yolov4-triton-tensorrt/yolov4.engine
[05/26/2022-07:48:15] [I] Profiling verbosity: 0
[05/26/2022-07:48:15] [I] Tactic sources: Using default tactic sources
[05/26/2022-07:48:15] [I] timingCacheMode: local
[05/26/2022-07:48:15] [I] timingCacheFile: 
[05/26/2022-07:48:15] [I] Input(s)s format: fp32:CHW
[05/26/2022-07:48:15] [I] Output(s)s format: fp32:CHW
[05/26/2022-07:48:15] [I] Input build shapes: model
[05/26/2022-07:48:15] [I] Input calibration shapes: model
[05/26/2022-07:48:15] [I] === System Options ===
[05/26/2022-07:48:15] [I] Device: 0
[05/26/2022-07:48:15] [I] DLACore: 
[05/26/2022-07:48:15] [I] Plugins: /src/WORK-SPACE/yolov4-triton-tensorrt/build/libyoloplugin.so
[05/26/2022-07:48:15] [I] === Inference Options ===
[05/26/2022-07:48:15] [I] Batch: 1
[05/26/2022-07:48:15] [I] Input inference shapes: model
[05/26/2022-07:48:15] [I] Iterations: 10
[05/26/2022-07:48:15] [I] Duration: 3s (+ 200ms warm up)
[05/26/2022-07:48:15] [I] Sleep time: 0ms
[05/26/2022-07:48:15] [I] Idle time: 0ms
[05/26/2022-07:48:15] [I] Streams: 1
[05/26/2022-07:48:15] [I] ExposeDMA: Disabled
[05/26/2022-07:48:15] [I] Data transfers: Enabled
[05/26/2022-07:48:15] [I] Spin-wait: Disabled
[05/26/2022-07:48:15] [I] Multithreading: Disabled
[05/26/2022-07:48:15] [I] CUDA Graph: Disabled
[05/26/2022-07:48:15] [I] Separate profiling: Disabled
[05/26/2022-07:48:15] [I] Time Deserialize: Disabled
[05/26/2022-07:48:15] [I] Time Refit: Disabled
[05/26/2022-07:48:15] [I] Skip inference: Disabled
[05/26/2022-07:48:15] [I] Inputs:
[05/26/2022-07:48:15] [I] === Reporting Options ===
[05/26/2022-07:48:15] [I] Verbose: Disabled
[05/26/2022-07:48:15] [I] Averages: 10 inferences
[05/26/2022-07:48:15] [I] Percentile: 99
[05/26/2022-07:48:15] [I] Dump refittable layers:Disabled
[05/26/2022-07:48:15] [I] Dump output: Disabled
[05/26/2022-07:48:15] [I] Profile: Disabled
[05/26/2022-07:48:15] [I] Export timing to JSON file: 
[05/26/2022-07:48:15] [I] Export output to JSON file: 
[05/26/2022-07:48:15] [I] Export profile to JSON file: 
[05/26/2022-07:48:15] [I] 
[05/26/2022-07:48:15] [I] === Device Information ===
[05/26/2022-07:48:15] [I] Selected Device: Quadro M4000
[05/26/2022-07:48:15] [I] Compute Capability: 5.2
[05/26/2022-07:48:15] [I] SMs: 13
[05/26/2022-07:48:15] [I] Compute Clock Rate: 0.7725 GHz
[05/26/2022-07:48:15] [I] Device Global Memory: 8119 MiB
[05/26/2022-07:48:15] [I] Shared Memory per SM: 96 KiB
[05/26/2022-07:48:15] [I] Memory Bus Width: 256 bits (ECC disabled)
[05/26/2022-07:48:15] [I] Memory Clock Rate: 3.005 GHz
[05/26/2022-07:48:15] [I] 
[05/26/2022-07:48:15] [I] TensorRT version: 8.2.3
[05/26/2022-07:48:15] [I] Loading supplied plugin library: /src/WORK-SPACE/yolov4-triton-tensorrt/build/libyoloplugin.so
[05/26/2022-07:48:15] [I] [TRT] [MemUsageChange] Init CUDA: CPU +175, GPU +0, now: CPU 751, GPU 167 (MiB)
[05/26/2022-07:48:15] [I] [TRT] Loaded engine size: 564 MiB
[05/26/2022-07:48:16] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +253, GPU +102, now: CPU 1019, GPU 834 (MiB)
[05/26/2022-07:48:16] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +112, GPU +44, now: CPU 1131, GPU 878 (MiB)
[05/26/2022-07:48:16] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +562, now: CPU 0, GPU 562 (MiB)
[05/26/2022-07:48:16] [I] Engine loaded in 1.14759 sec.
[05/26/2022-07:48:16] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 566, GPU 870 (MiB)
[05/26/2022-07:48:16] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 567, GPU 878 (MiB)
[05/26/2022-07:48:16] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +700, now: CPU 0, GPU 1262 (MiB)
[05/26/2022-07:48:16] [I] Using random values for input input
[05/26/2022-07:48:16] [I] Created input binding for input with dimensions 3x608x608
[05/26/2022-07:48:16] [I] Using random values for output detections
[05/26/2022-07:48:16] [I] Created output binding for detections with dimensions 159201x1x1
[05/26/2022-07:48:16] [I] Starting inference
[05/26/2022-07:48:19] [I] Warmup completed 1 queries over 200 ms
[05/26/2022-07:48:19] [I] Timing trace has 44 queries over 3.00643 s
[05/26/2022-07:48:19] [I] 
[05/26/2022-07:48:19] [I] === Trace details ===
[05/26/2022-07:48:19] [I] Trace averages of 10 runs:
[05/26/2022-07:48:19] [I] Average on 10 runs - GPU latency: 67.7814 ms - Host latency: 68.3163 ms (end to end 130.734 ms, enqueue 1.3227 ms)
[05/26/2022-07:48:19] [I] Average on 10 runs - GPU latency: 67.8393 ms - Host latency: 68.3703 ms (end to end 135.555 ms, enqueue 1.53979 ms)
[05/26/2022-07:48:19] [I] Average on 10 runs - GPU latency: 67.9055 ms - Host latency: 68.4368 ms (end to end 135.656 ms, enqueue 1.50281 ms)
[05/26/2022-07:48:19] [I] Average on 10 runs - GPU latency: 67.8812 ms - Host latency: 68.4139 ms (end to end 135.627 ms, enqueue 1.50432 ms)
[05/26/2022-07:48:19] [I] 
[05/26/2022-07:48:19] [I] === Performance summary ===
[05/26/2022-07:48:19] [I] Throughput: 14.6353 qps
[05/26/2022-07:48:19] [I] Latency: min = 68.1573 ms, max = 68.5181 ms, mean = 68.3875 ms, median = 68.4031 ms, percentile(99%) = 68.5181 ms
[05/26/2022-07:48:19] [I] End-to-End Host Latency: min = 88.3482 ms, max = 135.781 ms, mean = 134.508 ms, median = 135.582 ms, percentile(99%) = 135.781 ms
[05/26/2022-07:48:19] [I] Enqueue Time: min = 0.641052 ms, max = 1.61182 ms, mean = 1.47265 ms, median = 1.51697 ms, percentile(99%) = 1.61182 ms
[05/26/2022-07:48:19] [I] H2D Latency: min = 0.458679 ms, max = 0.488342 ms, mean = 0.463996 ms, median = 0.464172 ms, percentile(99%) = 0.488342 ms
[05/26/2022-07:48:19] [I] GPU Compute Time: min = 67.6239 ms, max = 67.991 ms, mean = 67.8552 ms, median = 67.8704 ms, percentile(99%) = 67.991 ms
[05/26/2022-07:48:19] [I] D2H Latency: min = 0.0668945 ms, max = 0.0698242 ms, mean = 0.0683836 ms, median = 0.0681152 ms, percentile(99%) = 0.0698242 ms
[05/26/2022-07:48:19] [I] Total Host Walltime: 3.00643 s
[05/26/2022-07:48:19] [I] Total GPU Compute Time: 2.98563 s
[05/26/2022-07:48:19] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/26/2022-07:48:19] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8203] # /usr/src/tensorrt/bin/trtexec --loadEngine=/src/WORK-SPACE/yolov4-triton-tensorrt/yolov4.engine --plugins=/src/WORK-SPACE/yolov4-triton-tensorrt/build/libyoloplugin.so

But I am not able to use it with NVIDIA Triton Sever. It is giving 'error: creating server: Internal - failed to load all models'

I0526 08:10:51.582245 1 libtorch.cc:1309] TRITONBACKEND_Initialize: pytorch
I0526 08:10:51.582310 1 libtorch.cc:1319] Triton TRITONBACKEND API version: 1.8
I0526 08:10:51.582316 1 libtorch.cc:1325] 'pytorch' TRITONBACKEND API version: 1.8
2022-05-26 08:10:51.724343: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-05-26 08:10:51.766295: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0526 08:10:51.766357 1 tensorflow.cc:2176] TRITONBACKEND_Initialize: tensorflow
I0526 08:10:51.766375 1 tensorflow.cc:2186] Triton TRITONBACKEND API version: 1.8
I0526 08:10:51.766380 1 tensorflow.cc:2192] 'tensorflow' TRITONBACKEND API version: 1.8
I0526 08:10:51.766384 1 tensorflow.cc:2216] backend configuration:
{}
I0526 08:10:51.789846 1 onnxruntime.cc:2319] TRITONBACKEND_Initialize: onnxruntime
I0526 08:10:51.789862 1 onnxruntime.cc:2329] Triton TRITONBACKEND API version: 1.8
I0526 08:10:51.789868 1 onnxruntime.cc:2335] 'onnxruntime' TRITONBACKEND API version: 1.8
I0526 08:10:51.789872 1 onnxruntime.cc:2365] backend configuration:
{}
I0526 08:10:51.820249 1 openvino.cc:1207] TRITONBACKEND_Initialize: openvino
I0526 08:10:51.820265 1 openvino.cc:1217] Triton TRITONBACKEND API version: 1.8
I0526 08:10:51.820270 1 openvino.cc:1223] 'openvino' TRITONBACKEND API version: 1.8
I0526 08:10:51.941298 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x203300000' with size 268435456
I0526 08:10:51.941524 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
E0526 08:10:51.942006 1 model_repository_manager.cc:1927] Poll failed for model directory 'plugins': unexpected platform type '' for plugins
I0526 08:10:51.942871 1 model_repository_manager.cc:997] loading: yolov4:1
I0526 08:10:52.048058 1 tensorrt.cc:5231] TRITONBACKEND_Initialize: tensorrt
I0526 08:10:52.048107 1 tensorrt.cc:5241] Triton TRITONBACKEND API version: 1.8
I0526 08:10:52.048127 1 tensorrt.cc:5247] 'tensorrt' TRITONBACKEND API version: 1.8
I0526 08:10:52.048256 1 tensorrt.cc:5290] backend configuration:
{}
I0526 08:10:52.048335 1 tensorrt.cc:5342] TRITONBACKEND_ModelInitialize: yolov4 (version 1)
I0526 08:10:52.277925 1 logging.cc:49] [MemUsageChange] Init CUDA: CPU +170, GPU +0, now: CPU 231, GPU 231 (MiB)
I0526 08:10:52.902147 1 logging.cc:49] Loaded engine size: 564 MiB
I0526 08:10:53.320724 1 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +252, GPU +102, now: CPU 1628, GPU 898 (MiB)
I0526 08:10:53.461535 1 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +113, GPU +44, now: CPU 1741, GPU 942 (MiB)
I0526 08:10:53.462972 1 logging.cc:49] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +562, now: CPU 0, GPU 562 (MiB)
I0526 08:10:53.526723 1 tensorrt.cc:5391] TRITONBACKEND_ModelInstanceInitialize: yolov4 (GPU device 0)
I0526 08:10:53.527364 1 logging.cc:49] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 596, GPU 362 (MiB)
I0526 08:10:54.107624 1 logging.cc:49] Loaded engine size: 564 MiB
I0526 08:10:54.214832 1 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1741, GPU 934 (MiB)
I0526 08:10:54.215649 1 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1741, GPU 942 (MiB)
I0526 08:10:54.216969 1 logging.cc:49] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +562, now: CPU 0, GPU 562 (MiB)
I0526 08:10:54.258164 1 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 612, GPU 934 (MiB)
I0526 08:10:54.258953 1 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 612, GPU 942 (MiB)
I0526 08:10:54.301531 1 logging.cc:49] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +700, now: CPU 0, GPU 1262 (MiB)
I0526 08:10:54.301911 1 tensorrt.cc:1417] Created instance yolov4 on GPU 0 with stream priority 0
I0526 08:10:54.302107 1 model_repository_manager.cc:1152] successfully loaded 'yolov4' version 1
I0526 08:10:54.302229 1 server.cc:524] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0526 08:10:54.302372 1 server.cc:551] 
+-------------+-------------------------------------------------------------------------+--------+
| Backend     | Path                                                                    | Config |
+-------------+-------------------------------------------------------------------------+--------+
| pytorch     | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so                 | {}     |
| tensorflow  | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so         | {}     |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so         | {}     |
| openvino    | /opt/tritonserver/backends/openvino_2021_4/libtriton_openvino_2021_4.so | {}     |
| tensorrt    | /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so               | {}     |
+-------------+-------------------------------------------------------------------------+--------+

I0526 08:10:54.302444 1 server.cc:594] 
+--------+---------+--------+
| Model  | Version | Status |
+--------+---------+--------+
| yolov4 | 1       | READY  |
+--------+---------+--------+

I0526 08:10:54.349630 1 metrics.cc:651] Collecting metrics for GPU 0: Quadro M4000
I0526 08:10:54.349974 1 tritonserver.cc:1962] 
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                        |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                       |
| server_version                   | 2.20.0                                                                                                                                                                                       |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0]         | /models                                                                                                                                                                                      |
| model_control_mode               | MODE_NONE                                                                                                                                                                                    |
| strict_model_config              | 0                                                                                                                                                                                            |
| rate_limit                       | OFF                                                                                                                                                                                          |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                    |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                     |
| response_cache_byte_size         | 0                                                                                                                                                                                            |
| min_supported_compute_capability | 5.2                                                                                                                                                                                          |
| strict_readiness                 | 1                                                                                                                                                                                            |
| exit_timeout                     | 30                                                                                                                                                                                           |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0526 08:10:54.349987 1 server.cc:252] Waiting for in-flight requests to complete.
I0526 08:10:54.349993 1 model_repository_manager.cc:1029] unloading: yolov4:1
I0526 08:10:54.350048 1 server.cc:267] Timeout 30: Found 1 live models and 0 in-flight non-inference requests
I0526 08:10:54.350176 1 tensorrt.cc:5429] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I0526 08:10:54.363525 1 tensorrt.cc:5368] TRITONBACKEND_ModelFinalize: delete model state
I0526 08:10:54.391386 1 model_repository_manager.cc:1135] successfully unloaded 'yolov4' version 1
I0526 08:10:55.350155 1 server.cc:267] Timeout 29: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models
W0526 08:10:55.352393 1 metrics.cc:469] Unable to get energy consumption for GPU 0. Status:Success, value:0
W0526 08:10:56.352713 1 metrics.cc:469] Unable to get energy consumption for GPU 0. Status:Success, value:0

Plugin is getting loaded, but then it is unloaded for some unknow reason.

isarsoft / yolov4-triton-tensorrt

error: creating server: Internal - failed to load all models - NVIDIA Triton Server for YOLOv4 #67