vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.75k stars 4.49k forks source link

[Usage]: Can't support Phi-3-medium-* models with more than 2 GPUs #5951

Closed fzp0424 closed 2 weeks ago

fzp0424 commented 4 months ago

Your current environment

When I set VLLM_TENSOR_PARALLEL_SIZE = 2, it works well. But when I change it to 4, vllm can not support Phi3-medium-*.

torch=2.3.0
vllm=0.5.0.post1
transformers=4.42.0.dev0

I also see the same problem in other issues and discussion https://github.com/vllm-project/vllm/discussions/5500

🐛 Describe the bug

Models and parameters

INFO 06-28 07:44:00 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/data/zhaopengfeng/models/Phi-3-mediu
m-4k-instruct', speculative_config=None, tokenizer='/data/zhaopengfeng/models/Phi-3-medium-4k-instruct', skip_tokenizer_init=False, tokeni
zer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, ma
x_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, disable_custom_all_reduce=True, quantization=None,
 enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding
_backend='outlines'), seed=0, served_model_name=/data/zhaopengfeng/models/Phi-3-medium-4k-instruct)        

Error infomation

(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]   File "/home/zhaopengfeng/anaconda3/envs/vllm/lib/python3.11/site-p
ackages/vllm/model_executor/model_loader/loader.py", line 98, in _initialize_model
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]     return model_class(config=model_config.hf_config,
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]   File "/home/zhaopengfeng/anaconda3/envs/vllm/lib/python3.11/site-p
ackages/vllm/model_executor/models/llama.py", line 340, in __init__
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]     self.model = LlamaModel(config,
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]                  ^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]   File "/home/zhaopengfeng/anaconda3/envs/vllm/lib/python3.11/site-p
ackages/vllm/model_executor/models/llama.py", line 262, in __init__
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]     self.layers = nn.ModuleList([
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]                                 ^
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]   File "/home/zhaopengfeng/anaconda3/envs/vllm/lib/python3.11/site-p
ackages/vllm/model_executor/models/llama.py", line 263, in <listcomp> 
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]     LlamaDecoderLayer(config=config,
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]   File "/home/zhaopengfeng/anaconda3/envs/vllm/lib/python3.11/site-p
ackages/vllm/model_executor/models/llama.py", line 188, in __init__
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]     self.self_attn = LlamaAttention(
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]                      ^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]   File "/home/zhaopengfeng/anaconda3/envs/vllm/lib/python3.11/site-p
ackages/vllm/model_executor/models/llama.py", line 109, in __init__
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]     assert self.total_num_kv_heads % tp_size == 0
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148] AssertionError
1149722739 commented 4 months ago

me too

ccruttjr commented 3 months ago

5500 Also talks about this in the Q&A section. I'll paste in what I responded with. It'd be great if someone cleared up if this is just how it'll be or if there needs to be some updated/added/new code for Phi specifically:

Bumping. vllm/model_executor/models/llama.py relies on these assert statements

assert self.total_num_heads % tp_size == 0
# ...
assert self.total_num_kv_heads % tp_size == 0
# or
assert tp_size % self.total_num_kv_heads == 0

And I also believe serving requires vocab size to be divisible by tp? And hidden size? And hidden layers? The issue is that

total_num_heads = 40
total_num_kv_heads = 10
vocab_size = 32064
hidden_size = 5120
num_hidden_layers = 40

Again, not sure about the vocab size, but, if that is the case, only rigs with 2 GPUs would work. If vocab size doesn't matter, we'd have to have either 2, 10, 20, or 40 GPUs

DarkLight1337 commented 2 weeks ago

5500 Also talks about this in the Q&A section. I'll paste in what I responded with. It'd be great if someone cleared up if this is just how it'll be or if there needs to be some updated/added/new code for Phi specifically:

Bumping. vllm/model_executor/models/llama.py relies on these assert statements

assert self.total_num_heads % tp_size == 0
# ...
assert self.total_num_kv_heads % tp_size == 0
# or
assert tp_size % self.total_num_kv_heads == 0

And I also believe serving requires vocab size to be divisible by tp? And hidden size? And hidden layers? The issue is that

total_num_heads = 40
total_num_kv_heads = 10
vocab_size = 32064
hidden_size = 5120
num_hidden_layers = 40

Again, not sure about the vocab size, but, if that is the case, only rigs with 2 GPUs would work. If vocab size doesn't matter, we'd have to have either 2, 10, 20, or 40 GPUs

Sorry for long delay! Yes, that's just how TP works. The number of attention heads must be divisible by the number of GPUs used.