vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.75k stars 3.91k forks source link

[Bug]: Runtime AssertionError: 32768 is not divisible by 3, multiproc_worker_utils.py:120, when using 3 GPUs for tensor-parallel #6385

Open haltingstate opened 2 months ago

haltingstate commented 2 months ago

Some LLM models have assertion error, in multiproc_worker_utils.py:120

This bug is critical and preventing deployment for client.

This is a run-time error, not a shut down error


  1. model: AI-ModelScope/Mixtral-8x22B-Instruct-v0.1

  1. Settings: tensor-parallel-size 3
    • for 3 GPU

  1. Error:

[rank0]: AssertionError: 32768 is not divisible by 3

ERROR 07-13 02:56:30 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 382811 died, exit code: -15 INFO 07-13 02:56:30 multiproc_worker_utils.py:123] Killing local vLLM worker processes


Also

(VllmWorkerProcess pid=412146) WARNING 07-13 03:43:42 custom_all_reduce.py:129] Custom allreduce is disabled due to an unsupported world size: 3. Supported world sizes: [2, 4, 6, 8]. To silence this warning, specify disable_custom_all_reduce=True explicitly. WARNING 07-13 03:43:42 custom_all_reduce.py:129] Custom allreduce is disabled due to an unsupported world size: 3. Supported world sizes: [2, 4, 6, 8]. To silence this warning, specify disable_custom_all_reduce=True explicitly. Traceback (most recent call last): File "/usr/lib/python3.8/multiprocessing/resource_tracker.py", line 201, in main cache[rtype].remove(name) KeyError: '/psm_dea5b192' Traceback (most recent call last): File "/usr/lib/python3.8/multiprocessing/resource_tracker.py", line 201, in main cache[rtype].remove(name) KeyError: '/psm_dea5b192' (VllmWorkerProcess pid=412146) ERROR 07-13 03:43:42 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method load_model: 32768 is not divisible by 3, Traceback (most recent call last): (VllmWorkerProcess pid=412146) ERROR 07-13 03:43:42 multiproc_worker_utils.py:226] File "/home/cx/.local/lib/python3.8/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process

youkaichao commented 2 months ago

in which case do you use 3 GPUs? usually people use 1/2/4/8 GPUs.

haltingstate commented 2 months ago

in which case do you use 3 GPUs? usually people use 1/2/4/8 GPUs.

There are 4x PCI slots in 8x8x8x8 configuration.

One slot is taken up by AMD Alveo FPGA accelerator card and for fiber optics.

The maximum number of GPUs that can be fit into the server is 3x because of networking requirement. The base is AMD Rome 64 Core processors, because matrix operations (sparse, factorization, inverse) are done on CPU, that cannot be offloaded to GPU.

And FPGA acceleration card is requirement for customer deployment. For matrix multiplication free and sparse layer offload, which cannot be performed on current generation GPU.


This issue also affects Frontier Chinese agent models like Qwen2, where the vocab size is a prime number and therefore not divisible by GPU count.

If the vocab size is prime, such as in Qwen2, then only 1 GPU can be used. 2,3,4 GPUs are not divisible by the vocab size.

The top models in ranking do not deploy.

youkaichao commented 2 months ago

there is an issue for this https://github.com/vllm-project/vllm/issues/5003 , but basically it is complicated.

maybe you can try pipeline parallel size = 3?

haltingstate commented 2 months ago

there is an issue for this #5003 , but basically it is complicated.

maybe you can try pipeline parallel size = 3?

The models are too large, and we need to use tensor parallel, to free up GPU memory for context and for pipelining multiple requests.

How can Qwen2 be run at all if the number of attention heads is a prime number?

haltingstate commented 2 months ago

Will this pull request be merged in, for the next release?

https://github.com/vllm-project/vllm/pull/5367

zhuoyue commented 2 months ago

Will this pull request be merged in, for the next release?

5367

+1