Closed fzp0424 closed 2 weeks ago
me too
Bumping. vllm/model_executor/models/llama.py
relies on these assert statements
assert self.total_num_heads % tp_size == 0
# ...
assert self.total_num_kv_heads % tp_size == 0
# or
assert tp_size % self.total_num_kv_heads == 0
And I also believe serving requires vocab size to be divisible by tp? And hidden size? And hidden layers? The issue is that
total_num_heads = 40
total_num_kv_heads = 10
vocab_size = 32064
hidden_size = 5120
num_hidden_layers = 40
Again, not sure about the vocab size, but, if that is the case, only rigs with 2 GPUs would work. If vocab size doesn't matter, we'd have to have either 2, 10, 20, or 40 GPUs
5500 Also talks about this in the Q&A section. I'll paste in what I responded with. It'd be great if someone cleared up if this is just how it'll be or if there needs to be some updated/added/new code for Phi specifically:
Bumping.
vllm/model_executor/models/llama.py
relies on these assert statementsassert self.total_num_heads % tp_size == 0 # ... assert self.total_num_kv_heads % tp_size == 0 # or assert tp_size % self.total_num_kv_heads == 0
And I also believe serving requires vocab size to be divisible by tp? And hidden size? And hidden layers? The issue is that
total_num_heads = 40 total_num_kv_heads = 10 vocab_size = 32064 hidden_size = 5120 num_hidden_layers = 40
Again, not sure about the vocab size, but, if that is the case, only rigs with 2 GPUs would work. If vocab size doesn't matter, we'd have to have either 2, 10, 20, or 40 GPUs
Sorry for long delay! Yes, that's just how TP works. The number of attention heads must be divisible by the number of GPUs used.
Your current environment
When I set
VLLM_TENSOR_PARALLEL_SIZE = 2
, it works well. But when I change it to 4, vllm can not support Phi3-medium-*.I also see the same problem in other issues and discussion https://github.com/vllm-project/vllm/discussions/5500
🐛 Describe the bug
Models and parameters
Error infomation