triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.3k stars 1.48k forks source link

Question about server/docs/examples/stable_diffusion/ #5349

Open BobDLA opened 1 year ago

BobDLA commented 1 year ago

In the demo, it accelerate vae model by

Accelerating VAE with TensorRT

trtexec --onnx=vae.onnx --saveEngine=vae.plan --minShapes=latent_sample:1x4x64x64 --optShapes=latent_sample:4x4x64x64 --maxShapes=latent_sample:8x4x64x64 --fp16

But, keep encoder.onnx as it is. No accelerate.

My questions are

Thanks

rmccorm4 commented 1 year ago

Could the encoder.onnx be converted also? Will the encoder.onnx acclerated also in onnx type?

CC @tanayvarshney for questions about the stable diffusion example

Is there api to set it to auto convert the model when triton server is initializing?

Yes you can make use of the Optimization section of the model config to have ONNX models automatically converted to TensorRT engines on server startup (assuming it is convertible).

tanayvarshney commented 1 year ago

@BobDLA This example is using the Text to Image pipeline. This pipeline doesn't need the VAE encoder. The encoder is needed for the Image to Image and the inpainting/outpainting pipelines.

BobDLA commented 1 year ago

@BobDLA This example is using the Text to Image pipeline. This pipeline doesn't need the VAE encoder. The encoder is needed for the Image to Image and the inpainting/outpainting pipelines.

  • Could the encoder.onnx be converted also?: Yes, can be exported to onnx and accelerated with TRT
  • Will the encoder.onnx acclerated also in onnx type? Not entirely certain about what you mean here, but triton supports a wide variety of models, be it .onnx or .pt or many others. Triton doesn't accelerate the models by default. You can choose to use some accelerators for the onnx backend like ORT-TRT but you will probably get better performance out of using TensorRT natively for acceleration
  • Is there api to set it to auto convert the model when triton server is initializing?: Triton doesn't automatically convert any of the models, you will need to specify the acceleration type in the config file. That said, native conversion is likely going to yield better performance. There are scripts and explanations in the example. I highly encourage you to watch the explainer video in the example for more context

@tanayvarshney Thanks for your answer. I will try it. I want to run the demo on a 4*8G server. But, it failed with out of memory. Can you help to make it work? Thanks Here is the log

docker image version: nvcr.io/nvidia/tritonserver:23.01-py3

root@760c1ad34b85:/workspace/docs/examples/stable_diffusion# tritonserver --model-repository=model_repository/ I0214 03:52:07.296971 5265 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fa7a8000000' with size 268435456 I0214 03:52:07.298917 5265 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864 I0214 03:52:07.298927 5265 cuda_memory_manager.cc:105] CUDA memory pool is created on device 1 with size 67108864 I0214 03:52:07.298944 5265 cuda_memory_manager.cc:105] CUDA memory pool is created on device 2 with size 67108864 I0214 03:52:07.298950 5265 cuda_memory_manager.cc:105] CUDA memory pool is created on device 3 with size 67108864 W0214 03:52:07.563825 5265 server.cc:218] failed to enable peer access for some device pairs I0214 03:52:07.567952 5265 model_lifecycle.cc:459] loading: pipeline:1 I0214 03:52:07.568008 5265 model_lifecycle.cc:459] loading: text_encoder:1 I0214 03:52:07.568038 5265 model_lifecycle.cc:459] loading: vae:1 I0214 03:52:07.574775 5265 onnxruntime.cc:2459] TRITONBACKEND_Initialize: onnxruntime I0214 03:52:07.574828 5265 onnxruntime.cc:2469] Triton TRITONBACKEND API version: 1.11 I0214 03:52:07.574843 5265 onnxruntime.cc:2475] 'onnxruntime' TRITONBACKEND API version: 1.11 I0214 03:52:07.574858 5265 onnxruntime.cc:2505] backend configuration: {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} I0214 03:52:10.964694 5265 onnxruntime.cc:2563] TRITONBACKEND_ModelInitialize: text_encoder (version 1) I0214 03:52:10.965708 5265 onnxruntime.cc:666] skipping model configuration auto-complete for 'text_encoder': inputs and outputs already specified I0214 03:52:10.966981 5265 onnxruntime.cc:2563] TRITONBACKEND_ModelInitialize: vae (version 1) I0214 03:52:10.967812 5265 onnxruntime.cc:666] skipping model configuration auto-complete for 'vae': inputs and outputs already specified I0214 03:52:10.968928 5265 python_be.cc:1858] TRITONBACKEND_ModelInstanceInitialize: pipeline_0 (GPU device 0) I0214 03:52:19.226810 5265 onnxruntime.cc:2606] TRITONBACKEND_ModelInstanceInitialize: text_encoder_0 (GPU device 0) 2023-02-14 03:52:19.916557805 [W:onnxruntime:, inference_session.cc:510 RegisterExecutionProvider] Parallel execution mode does not support the CUDA Execution Provider. So making the execution mode sequential for this session since it uses the CUDA Execution Provider. 2023-02-14 03:52:20.130929081 [W:onnxruntime:, session_state.cc:1030 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2023-02-14 03:52:20.130943819 [W:onnxruntime:, session_state.cc:1032 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. I0214 03:52:21.286379 5265 onnxruntime.cc:2606] TRITONBACKEND_ModelInstanceInitialize: vae_0 (GPU device 0) 2023-02-14 03:52:21.515297079 [W:onnxruntime:, session_state.cc:1030 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2023-02-14 03:52:21.515320616 [W:onnxruntime:, session_state.cc:1032 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. I0214 03:52:21.599133 5265 python_be.cc:1858] TRITONBACKEND_ModelInstanceInitialize: pipeline_0 (GPU device 1) I0214 03:52:30.752293 5265 onnxruntime.cc:2606] TRITONBACKEND_ModelInstanceInitialize: text_encoder_0 (GPU device 1) 2023-02-14 03:52:31.374174687 [W:onnxruntime:, inference_session.cc:510 RegisterExecutionProvider] Parallel execution mode does not support the CUDA Execution Provider. So making the execution mode sequential for this session since it uses the CUDA Execution Provider. 2023-02-14 03:52:31.583102253 [W:onnxruntime:, session_state.cc:1030 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2023-02-14 03:52:31.583117916 [W:onnxruntime:, session_state.cc:1032 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. I0214 03:52:32.357585 5265 onnxruntime.cc:2606] TRITONBACKEND_ModelInstanceInitialize: vae_0 (GPU device 1) 2023-02-14 03:52:32.546151567 [W:onnxruntime:, session_state.cc:1030 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2023-02-14 03:52:32.546174375 [W:onnxruntime:, session_state.cc:1032 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. I0214 03:52:32.631071 5265 python_be.cc:1858] TRITONBACKEND_ModelInstanceInitialize: pipeline_0 (GPU device 2) I0214 03:52:40.961642 5265 pb_stub.cc:314] Failed to initialize Python stub: OutOfMemoryError: CUDA out of memory. Tried to allocate 58.00 MiB (GPU 0; 7.79 GiB total capacity; 996.43 MiB already allocated; 29.38 MiB free; 1.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

At: /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(987): convert /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(664): _apply /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(641): _apply /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(641): _apply /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(641): _apply /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(641): _apply /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(641): _apply /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(989): to /workspace/docs/examples/stable_diffusion/model_repository/pipeline/1/model.py(53): initialize

I0214 03:52:41.722845 5265 onnxruntime.cc:2606] TRITONBACKEND_ModelInstanceInitialize: text_encoder_0 (GPU device 2) 2023-02-14 03:52:42.275243534 [W:onnxruntime:, inference_session.cc:510 RegisterExecutionProvider] Parallel execution mode does not support the CUDA Execution Provider. So making the execution mode sequential for this session since it uses the CUDA Execution Provider. 2023-02-14 03:52:42.489275487 [W:onnxruntime:, session_state.cc:1030 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2023-02-14 03:52:42.489290152 [W:onnxruntime:, session_state.cc:1032 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. I0214 03:52:43.333758 5265 onnxruntime.cc:2606] TRITONBACKEND_ModelInstanceInitialize: vae_0 (GPU device 2) 2023-02-14 03:52:43.521740289 [W:onnxruntime:, session_state.cc:1030 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2023-02-14 03:52:43.521761972 [W:onnxruntime:, session_state.cc:1032 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. I0214 03:52:43.595104 5265 onnxruntime.cc:2606] TRITONBACKEND_ModelInstanceInitialize: text_encoder_0 (GPU device 3) 2023-02-14 03:52:44.068587381 [W:onnxruntime:, inference_session.cc:510 RegisterExecutionProvider] Parallel execution mode does not support the CUDA Execution Provider. So making the execution mode sequential for this session since it uses the CUDA Execution Provider. 2023-02-14 03:52:44.278063801 [W:onnxruntime:, session_state.cc:1030 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2023-02-14 03:52:44.278078697 [W:onnxruntime:, session_state.cc:1032 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. I0214 03:52:45.039659 5265 onnxruntime.cc:2606] TRITONBACKEND_ModelInstanceInitialize: vae_0 (GPU device 3) I0214 03:52:45.040076 5265 model_lifecycle.cc:694] successfully loaded 'text_encoder' version 1 E0214 03:52:45.156380 5265 model_lifecycle.cc:597] failed to load 'pipeline' version 1: Internal: OutOfMemoryError: CUDA out of memory. Tried to allocate 58.00 MiB (GPU 0; 7.79 GiB total capacity; 996.43 MiB already allocated; 29.38 MiB free; 1.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

At: /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(987): convert /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(664): _apply /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(641): _apply /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(641): _apply /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(641): _apply /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(641): _apply /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(641): _apply /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(989): to /workspace/docs/examples/stable_diffusion/model_repository/pipeline/1/model.py(53): initialize

2023-02-14 03:52:45.208641855 [W:onnxruntime:, session_state.cc:1030 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2023-02-14 03:52:45.208660089 [W:onnxruntime:, session_state.cc:1032 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. I0214 03:52:45.282564 5265 model_lifecycle.cc:694] successfully loaded 'vae' version 1 I0214 03:52:45.282771 5265 server.cc:563] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+

I0214 03:52:45.282900 5265 server.cc:590] +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------+ | Backend | Path | Config | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------+ | python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000"," | | | | backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} | | onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000"," | | | | backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------+

I0214 03:52:45.283057 5265 server.cc:633] +--------------+---------+----------------------------------------------------------------------------------------------------------------------------------------+ Model Version Status +--------------+---------+----------------------------------------------------------------------------------------------------------------------------------------+ pipeline 1 UNAVAILABLE: Internal: OutOfMemoryError: CUDA out of memory. Tried to allocate 58.00 MiB (GPU 0; 7.79 GiB total capacity; 996.43 MiB a lready allocated; 29.38 MiB free; 1.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_spl it_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
At:
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(987): convert
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(664): _apply
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(641): _apply
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(641): _apply
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(641): _apply
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(641): _apply
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(641): _apply
/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(989): to
/workspace/docs/examples/stable_diffusion/model_repository/pipeline/1/model.py(53): initialize
text_encoder 1 READY
vae 1 READY

+--------------+---------+----------------------------------------------------------------------------------------------------------------------------------------+

I0214 03:52:45.327012 5265 metrics.cc:864] Collecting metrics for GPU 0: NVIDIA GeForce RTX 2080 I0214 03:52:45.327047 5265 metrics.cc:864] Collecting metrics for GPU 1: NVIDIA GeForce RTX 2080 I0214 03:52:45.327059 5265 metrics.cc:864] Collecting metrics for GPU 2: NVIDIA GeForce RTX 2080 I0214 03:52:45.327072 5265 metrics.cc:864] Collecting metrics for GPU 3: NVIDIA GeForce RTX 2080 I0214 03:52:45.328185 5265 metrics.cc:757] Collecting CPU metrics I0214 03:52:45.328400 5265 tritonserver.cc:2264] +----------------------------------+------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.30.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shar | | | ed_memory cuda_shared_memory binary_tensor_data statistics trace logging | | model_repository_path[0] | model_repository/ | | model_control_mode | MODE_NONE | | strict_model_config | 0 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | cuda_memory_pool_byte_size{1} | 67108864 | | cuda_memory_pool_byte_size{2} | 67108864 | | cuda_memory_pool_byte_size{3} | 67108864 | | response_cache_byte_size | 0 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | +----------------------------------+------------------------------------------------------------------------------------------------------------------------------+

I0214 03:52:45.328450 5265 server.cc:264] Waiting for in-flight requests to complete. I0214 03:52:45.328461 5265 server.cc:280] Timeout 30: Found 0 model versions that have in-flight inferences I0214 03:52:45.328581 5265 server.cc:295] All models are stopped, unloading models I0214 03:52:45.328595 5265 server.cc:302] Timeout 30: Found 2 live models and 0 in-flight non-inference requests I0214 03:52:45.328719 5265 onnxruntime.cc:2640] TRITONBACKEND_ModelInstanceFinalize: delete instance state I0214 03:52:45.328775 5265 onnxruntime.cc:2640] TRITONBACKEND_ModelInstanceFinalize: delete instance state I0214 03:52:45.345114 5265 onnxruntime.cc:2640] TRITONBACKEND_ModelInstanceFinalize: delete instance state I0214 03:52:45.367579 5265 onnxruntime.cc:2640] TRITONBACKEND_ModelInstanceFinalize: delete instance state I0214 03:52:45.372217 5265 onnxruntime.cc:2640] TRITONBACKEND_ModelInstanceFinalize: delete instance state I0214 03:52:45.384242 5265 onnxruntime.cc:2640] TRITONBACKEND_ModelInstanceFinalize: delete instance state I0214 03:52:45.406878 5265 onnxruntime.cc:2640] TRITONBACKEND_ModelInstanceFinalize: delete instance state I0214 03:52:45.408138 5265 onnxruntime.cc:2586] TRITONBACKEND_ModelFinalize: delete model state I0214 03:52:45.408171 5265 model_lifecycle.cc:579] successfully unloaded 'vae' version 1 I0214 03:52:45.434247 5265 onnxruntime.cc:2640] TRITONBACKEND_ModelInstanceFinalize: delete instance state I0214 03:52:45.463915 5265 onnxruntime.cc:2586] TRITONBACKEND_ModelFinalize: delete model state I0214 03:52:45.463988 5265 model_lifecycle.cc:579] successfully unloaded 'text_encoder' version 1 I0214 03:52:46.328713 5265 server.cc:302] Timeout 29: Found 0 live models and 0 in-flight non-inference requests error: creating server: Internal - failed to load all models

BobDLA commented 1 year ago

And, I try to set instance_group with count:0, but it's not work. Does the instance_group support to set to count:0 in case I want to set only one model totally, instead one model for each GPU?

ahmednofal commented 1 year ago

I am having a similar situation. I am using the C API server tho, it successfully loads the model but then unloads them, the part of logs showing the successful loading:

I0501 10:50:06.552750 11894 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0501 10:50:06.556296 11894 model_lifecycle.cc:459] loading: VehicleREID_onnx:1
I0501 10:50:06.557432 11894 onnxruntime.cc:2459] TRITONBACKEND_Initialize: onnxruntime
I0501 10:50:06.557450 11894 onnxruntime.cc:2469] Triton TRITONBACKEND API version: 1.10
I0501 10:50:06.557456 11894 onnxruntime.cc:2475] 'onnxruntime' TRITONBACKEND API version: 1.10
I0501 10:50:06.557460 11894 onnxruntime.cc:2505] backend configuration:
{"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}}
I0501 10:50:06.566806 11894 onnxruntime.cc:2563] TRITONBACKEND_ModelInitialize: VehicleREID_onnx (version 1)
I0501 10:50:06.567316 11894 onnxruntime.cc:666] skipping model configuration auto-complete for 'VehicleREID_onnx': inputs and outputs already specified
I0501 10:50:06.567674 11894 onnxruntime.cc:2606] TRITONBACKEND_ModelInstanceInitialize: VehicleREID_onnx (GPU device 0)
I0501 10:50:07.212686 11894 model_lifecycle.cc:694] successfully loaded 'VehicleREID_onnx' version 1
I0501 10:50:07.212794 11894 server.cc:563] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0501 10:50:07.212831 11894 server.cc:590] 
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                          |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------+
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000"," |
|             |                                                                 | backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}}  |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------+

I0501 10:50:07.212861 11894 server.cc:633] 
+------------------+---------+--------+
| Model            | Version | Status |
+------------------+---------+--------+
| VehicleREID_onnx | 1       | READY  |
+------------------+---------+--------+

I0501 10:50:07.249427 11894 metrics.cc:864] Collecting metrics for GPU 0: Tesla T4
I0501 10:50:07.249671 11894 metrics.cc:757] Collecting CPU metrics

the part of the logs showing successful unloading:

I0501 10:50:07.250015 11894 server.cc:264] Waiting for in-flight requests to complete.
I0501 10:50:07.250022 11894 server.cc:280] Timeout 30: Found 0 model versions that have in-flight inferences
I0501 10:50:07.250073 11894 server.cc:295] All models are stopped, unloading models
I0501 10:50:07.250080 11894 server.cc:302] Timeout 30: Found 1 live models and 0 in-flight non-inference requests
I0501 10:50:07.250268 11894 onnxruntime.cc:2640] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I0501 10:50:07.255547 11894 onnxruntime.cc:2586] TRITONBACKEND_ModelFinalize: delete model state
I0501 10:50:07.255606 11894 model_lifecycle.cc:579] successfully unloaded 'VehicleREID_onnx' version 1
I0501 10:50:08.250155 11894 server.cc:302] Timeout 29: Found 0 live models and 0 in-flight non-inference requests

full log:

I0501 10:50:06.550601 11894 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f882a000000' with size 268435456
I0501 10:50:06.552750 11894 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0501 10:50:06.556296 11894 model_lifecycle.cc:459] loading: VehicleREID_onnx:1
I0501 10:50:06.557432 11894 onnxruntime.cc:2459] TRITONBACKEND_Initialize: onnxruntime
I0501 10:50:06.557450 11894 onnxruntime.cc:2469] Triton TRITONBACKEND API version: 1.10
I0501 10:50:06.557456 11894 onnxruntime.cc:2475] 'onnxruntime' TRITONBACKEND API version: 1.10
I0501 10:50:06.557460 11894 onnxruntime.cc:2505] backend configuration:
{"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}}
I0501 10:50:06.566806 11894 onnxruntime.cc:2563] TRITONBACKEND_ModelInitialize: VehicleREID_onnx (version 1)
I0501 10:50:06.567316 11894 onnxruntime.cc:666] skipping model configuration auto-complete for 'VehicleREID_onnx': inputs and outputs already specified
I0501 10:50:06.567674 11894 onnxruntime.cc:2606] TRITONBACKEND_ModelInstanceInitialize: VehicleREID_onnx (GPU device 0)
I0501 10:50:07.212686 11894 model_lifecycle.cc:694] successfully loaded 'VehicleREID_onnx' version 1
I0501 10:50:07.212794 11894 server.cc:563] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0501 10:50:07.212831 11894 server.cc:590] 
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                          |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------+
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000"," |
|             |                                                                 | backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}}  |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------+

I0501 10:50:07.212861 11894 server.cc:633] 
+------------------+---------+--------+
| Model            | Version | Status |
+------------------+---------+--------+
| VehicleREID_onnx | 1       | READY  |
+------------------+---------+--------+

I0501 10:50:07.249427 11894 metrics.cc:864] Collecting metrics for GPU 0: Tesla T4
I0501 10:50:07.249671 11894 metrics.cc:757] Collecting CPU metrics
I0501 10:50:07.249854 11894 tritonserver.cc:2264] 
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                          |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton_C_API                                                                                                                   |
| server_version                   | 2.28.0                                                                                                                         |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared |
|                                  | _memory cuda_shared_memory binary_tensor_data statistics trace logging                                                         |
| model_repository_path[0]         | /path/to/model_repository                                                                                     |
| model_control_mode               | MODE_NONE                                                                                                                      |
| strict_model_config              | 0                                                                                                                              |
| rate_limit                       | OFF                                                                                                                            |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                      |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                       |
| response_cache_byte_size         | 0                                                                                                                              |
| min_supported_compute_capability | 6.0                                                                                                                            |
| strict_readiness                 | 1                                                                                                                              |
| exit_timeout                     | 30                                                                                                                             |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+

Server Health: live 1, ready 1
Server Metadata:
{"name":"triton_C_API","version":"2.28.0","extensions":["classification","sequence","model_repository","model_repository(unload_dependents)","schedule_policy","model_configuration","system_shared_memory","cuda_shared_memory","binary_tensor_data","statistics","trace","logging"]}
I0501 10:50:07.250015 11894 server.cc:264] Waiting for in-flight requests to complete.
I0501 10:50:07.250022 11894 server.cc:280] Timeout 30: Found 0 model versions that have in-flight inferences
I0501 10:50:07.250073 11894 server.cc:295] All models are stopped, unloading models
I0501 10:50:07.250080 11894 server.cc:302] Timeout 30: Found 1 live models and 0 in-flight non-inference requests
I0501 10:50:07.250268 11894 onnxruntime.cc:2640] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I0501 10:50:07.255547 11894 onnxruntime.cc:2586] TRITONBACKEND_ModelFinalize: delete model state
I0501 10:50:07.255606 11894 model_lifecycle.cc:579] successfully unloaded 'VehicleREID_onnx' version 1
I0501 10:50:08.250155 11894 server.cc:302] Timeout 29: Found 0 live models and 0 in-flight non-inference requests

Please help :(