triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.39k stars 1.49k forks source link

How to set the parameter make concurrent model execution? #7579

Open Will-Chou-5722 opened 2 months ago

Will-Chou-5722 commented 2 months ago

Description I noticed the "Concurrent Model Execution" section. Titron can enable parallel execution of the model when adjusting instance_group. Concurrent Model Execution

After adjusting instance_group to 4. I didn't find parallel execution situation. I only noticed that the cuda stream has increased. Are there any parameters need to adjust? Could you give me some suggestions? The picture below is the result of sending two requests(same model) at the same time and observing with nsight. 螢幕擷取畫面 2024-08-30 143240

The command as follow: Triton server ./tritonserver --model-repository=../docs/examples/model_repository/ Client ./perf_analyzer -m tensorrt_fp16_model --service-kind=triton -i HTTP & \ ./perf_analyzer -m tensorrt_fp16_model --service-kind=triton -i http

Information Nsysight Version: 2024.2.2.28-242234212449v0 Linux Hardware NVIDIA Jetson AGX Orin Jetpack 6.0 Triton server version:2.48.0

lei1liu commented 1 month ago

Hello Will, I reported a similar issue: #7706 I'm wondering if you got any clue on the solution? Thanks!

rmccorm4 commented 1 week ago

Hi @Will-Chou-5722,

I think your observations look correct. The TensorRT Backend specifically is unique in that it uses one thread for multiple model instances on the same GPU, whereas most other backends will have one thread per model instance. You can read more details about this TensorRT Backend behavior here (two different comments):

Will-Chou-5722 commented 4 days ago

Hi @rmccorm4 Thank you for the information. It's very helpful.