vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.06k stars 3.82k forks source link

[Bug]: Can't support Phi-3-medium-* models with more than 2 GPUs #5951

Open fzp0424 opened 2 months ago

fzp0424 commented 2 months ago

Your current environment

When I set VLLM_TENSOR_PARALLEL_SIZE = 2, it works well. But when I change it to 4, vllm can not support Phi3-medium-*.

torch=2.3.0
vllm=0.5.0.post1
transformers=4.42.0.dev0

I also see the same problem in other issues and discussion https://github.com/vllm-project/vllm/discussions/5500

🐛 Describe the bug

Models and parameters

INFO 06-28 07:44:00 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/data/zhaopengfeng/models/Phi-3-mediu
m-4k-instruct', speculative_config=None, tokenizer='/data/zhaopengfeng/models/Phi-3-medium-4k-instruct', skip_tokenizer_init=False, tokeni
zer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, ma
x_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, disable_custom_all_reduce=True, quantization=None,
 enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding
_backend='outlines'), seed=0, served_model_name=/data/zhaopengfeng/models/Phi-3-medium-4k-instruct)        

Error infomation

(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]   File "/home/zhaopengfeng/anaconda3/envs/vllm/lib/python3.11/site-p
ackages/vllm/model_executor/model_loader/loader.py", line 98, in _initialize_model
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]     return model_class(config=model_config.hf_config,
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]   File "/home/zhaopengfeng/anaconda3/envs/vllm/lib/python3.11/site-p
ackages/vllm/model_executor/models/llama.py", line 340, in __init__
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]     self.model = LlamaModel(config,
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]                  ^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]   File "/home/zhaopengfeng/anaconda3/envs/vllm/lib/python3.11/site-p
ackages/vllm/model_executor/models/llama.py", line 262, in __init__
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]     self.layers = nn.ModuleList([
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]                                 ^
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]   File "/home/zhaopengfeng/anaconda3/envs/vllm/lib/python3.11/site-p
ackages/vllm/model_executor/models/llama.py", line 263, in <listcomp> 
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]     LlamaDecoderLayer(config=config,
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]   File "/home/zhaopengfeng/anaconda3/envs/vllm/lib/python3.11/site-p
ackages/vllm/model_executor/models/llama.py", line 188, in __init__
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]     self.self_attn = LlamaAttention(
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]                      ^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]   File "/home/zhaopengfeng/anaconda3/envs/vllm/lib/python3.11/site-p
ackages/vllm/model_executor/models/llama.py", line 109, in __init__
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]     assert self.total_num_kv_heads % tp_size == 0
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=53479) ERROR 06-28 07:44:11 worker_base.py:148] AssertionError
1149722739 commented 2 months ago

me too

ccruttjr commented 1 month ago

5500 Also talks about this in the Q&A section. I'll paste in what I responded with. It'd be great if someone cleared up if this is just how it'll be or if there needs to be some updated/added/new code for Phi specifically:

Bumping. vllm/model_executor/models/llama.py relies on these assert statements

assert self.total_num_heads % tp_size == 0
# ...
assert self.total_num_kv_heads % tp_size == 0
# or
assert tp_size % self.total_num_kv_heads == 0

And I also believe serving requires vocab size to be divisible by tp? And hidden size? And hidden layers? The issue is that

total_num_heads = 40
total_num_kv_heads = 10
vocab_size = 32064
hidden_size = 5120
num_hidden_layers = 40

Again, not sure about the vocab size, but, if that is the case, only rigs with 2 GPUs would work. If vocab size doesn't matter, we'd have to have either 2, 10, 20, or 40 GPUs