Open SuperSecureHuman opened 6 months ago
Just adding on top...
Why is GPU underutilized... If it is just using 50% of both GPUs, then why it dosent use 100% of 1 GPU when launched with single GPU?
Or is there something I am doing wrong.
GPU Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X SYS 0-95,97-191 0 N/A GPU1 SYS X 0-95,97-191 0 N/A
Your GPU interconnect is slow. That's why you don't see benefit with tensor parallel.
Could you elaborate further please..
Addition - I would like to know how to diagnose this.. So that I can report to my systems team to try fix the server itself. Also, please let me know if we can convert this thread into discussion since it might not be a bug with VLLM itself.
Update: Also, Any reasons on why Single GPU dosent reach higher % of GPU usage?
Here is my understanding...
Ideally, in absense of NvLink, the GPU-GPU communication should happen through PCIe. But in this case, its happening through CPU, which is the bottleneck here.
Upon checking the PCIe topology of our Motherboard, looks like the issue lies there. Each PCIe is directly connected to CPU only... There is no common bridge.
In our case, we will have to install NvLink, to solve this issue.
Your current environment
🐛 Describe the bug
So, I am trying to serve Mistral 7B, but I am not very sure if these are the performance numbers that is to be expected. First I would like to know if I did some configuration issue, before I move this to be a bug.
Launch Method
Here is performance numbers from 1 GPU
2 GPUs
The throughput is ~1.5 times when using only 1 GPU... I would ideally expect closer to 1. Further more, GPU is being underutilized..
More numbers when setting fraction of GPU
0.5 memory
0.2 Memory