triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.28k stars 1.47k forks source link

What is the correct way to run inference in parallel in Triton? #7283

Open sandesha-hegde opened 5 months ago

sandesha-hegde commented 5 months ago

These are the below specification:

GPU : A100 * 8 OS: Oracle Linux 8 CPU: 128 Thread(s) per core: 2 CUDA Version: 12.2 Triton Version: 22.07-py3

I have total 5 models and I created 2 instances on each GPU. Combining all the models it's taking 0.4 req/sec in one GPU. I have total 8 A100 attached The performance I'm getting only 2.2 req/sec even after I'm using all the GPU.as per the triton documentation it should handle minimum 8 parallel requests for 8 gpu's but that's not happening. Can anyone let me know is there any other configuration I should try to improve the performance.

Tabrizian commented 1 month ago

Can you share your model configuration and the backend that you are using?