Open haltingstate opened 2 months ago
in which case do you use 3 GPUs? usually people use 1/2/4/8 GPUs.
in which case do you use 3 GPUs? usually people use 1/2/4/8 GPUs.
There are 4x PCI slots in 8x8x8x8 configuration.
One slot is taken up by AMD Alveo FPGA accelerator card and for fiber optics.
The maximum number of GPUs that can be fit into the server is 3x because of networking requirement. The base is AMD Rome 64 Core processors, because matrix operations (sparse, factorization, inverse) are done on CPU, that cannot be offloaded to GPU.
And FPGA acceleration card is requirement for customer deployment. For matrix multiplication free and sparse layer offload, which cannot be performed on current generation GPU.
This issue also affects Frontier Chinese agent models like Qwen2, where the vocab size is a prime number and therefore not divisible by GPU count.
If the vocab size is prime, such as in Qwen2, then only 1 GPU can be used. 2,3,4 GPUs are not divisible by the vocab size.
The top models in ranking do not deploy.
there is an issue for this https://github.com/vllm-project/vllm/issues/5003 , but basically it is complicated.
maybe you can try pipeline parallel size = 3?
there is an issue for this #5003 , but basically it is complicated.
maybe you can try pipeline parallel size = 3?
The models are too large, and we need to use tensor parallel, to free up GPU memory for context and for pipelining multiple requests.
How can Qwen2 be run at all if the number of attention heads is a prime number?
Will this pull request be merged in, for the next release?
Will this pull request be merged in, for the next release?
5367
+1
Some LLM models have assertion error, in multiproc_worker_utils.py:120
This bug is critical and preventing deployment for client.
This is a run-time error, not a shut down error
[rank0]: AssertionError: 32768 is not divisible by 3
ERROR 07-13 02:56:30 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 382811 died, exit code: -15 INFO 07-13 02:56:30 multiproc_worker_utils.py:123] Killing local vLLM worker processes
Also
(VllmWorkerProcess pid=412146) WARNING 07-13 03:43:42 custom_all_reduce.py:129] Custom allreduce is disabled due to an unsupported world size: 3. Supported world sizes: [2, 4, 6, 8]. To silence this warning, specify disable_custom_all_reduce=True explicitly. WARNING 07-13 03:43:42 custom_all_reduce.py:129] Custom allreduce is disabled due to an unsupported world size: 3. Supported world sizes: [2, 4, 6, 8]. To silence this warning, specify disable_custom_all_reduce=True explicitly. Traceback (most recent call last): File "/usr/lib/python3.8/multiprocessing/resource_tracker.py", line 201, in main cache[rtype].remove(name) KeyError: '/psm_dea5b192' Traceback (most recent call last): File "/usr/lib/python3.8/multiprocessing/resource_tracker.py", line 201, in main cache[rtype].remove(name) KeyError: '/psm_dea5b192' (VllmWorkerProcess pid=412146) ERROR 07-13 03:43:42 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method load_model: 32768 is not divisible by 3, Traceback (most recent call last): (VllmWorkerProcess pid=412146) ERROR 07-13 03:43:42 multiproc_worker_utils.py:226] File "/home/cx/.local/lib/python3.8/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process