GPU : A100 * 8
OS: Oracle Linux 8
CPU: 128
Thread(s) per core: 2
CUDA Version: 12.2
Triton Version: 22.07-py3
I have total 5 models and I created 2 instances on each GPU. Combining all the models it's taking 0.4 req/sec in one GPU. I have total 8 A100 attached The performance I'm getting only 2.2 req/sec even after I'm using all the GPU.as per the triton documentation it should handle minimum 8 parallel requests for 8 gpu's but that's not happening. Can anyone let me know is there any other configuration I should try to improve the performance.
These are the below specification:
GPU : A100 * 8 OS: Oracle Linux 8 CPU: 128 Thread(s) per core: 2 CUDA Version: 12.2 Triton Version: 22.07-py3
I have total 5 models and I created 2 instances on each GPU. Combining all the models it's taking 0.4 req/sec in one GPU. I have total 8 A100 attached The performance I'm getting only 2.2 req/sec even after I'm using all the GPU.as per the triton documentation it should handle minimum 8 parallel requests for 8 gpu's but that's not happening. Can anyone let me know is there any other configuration I should try to improve the performance.