Open Will-Chou-5722 opened 2 months ago
Hello Will, I reported a similar issue: #7706 I'm wondering if you got any clue on the solution? Thanks!
Hi @Will-Chou-5722,
I think your observations look correct. The TensorRT Backend specifically is unique in that it uses one thread for multiple model instances on the same GPU, whereas most other backends will have one thread per model instance. You can read more details about this TensorRT Backend behavior here (two different comments):
Hi @rmccorm4 Thank you for the information. It's very helpful.
Description I noticed the "Concurrent Model Execution" section. Titron can enable parallel execution of the model when adjusting instance_group.
After adjusting instance_group to 4. I didn't find parallel execution situation. I only noticed that the cuda stream has increased. Are there any parameters need to adjust? Could you give me some suggestions? The picture below is the result of sending two requests(same model) at the same time and observing with nsight.
The command as follow: Triton server
./tritonserver --model-repository=../docs/examples/model_repository/
Client./perf_analyzer -m tensorrt_fp16_model --service-kind=triton -i HTTP & \ ./perf_analyzer -m tensorrt_fp16_model --service-kind=triton -i http
Information Nsysight Version: 2024.2.2.28-242234212449v0 Linux Hardware NVIDIA Jetson AGX Orin Jetpack 6.0 Triton server version:2.48.0