What is the correct way to run inference in parallel in Triton?

triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.

BSD 3-Clause "New" or "Revised" License

8.28k stars 1.47k forks source link

These are the below specification:

GPU : A100 * 8 OS: Oracle Linux 8 CPU: 128 Thread(s) per core: 2 CUDA Version: 12.2 Triton Version: 22.07-py3

I have total 5 models and I created 2 instances on each GPU. Combining all the models it's taking 0.4 req/sec in one GPU. I have total 8 A100 attached The performance I'm getting only 2.2 req/sec even after I'm using all the GPU.as per the triton documentation it should handle minimum 8 parallel requests for 8 gpu's but that's not happening. Can anyone let me know is there any other configuration I should try to improve the performance.

triton-inference-server / server

What is the correct way to run inference in parallel in Triton? #7283