Closed ShangmingCai closed 5 months ago
Found a way to work around.
How did you work around this? Facing the same issue now with llama3 and tinyllama
How did you work around this? Facing the same issue now with llama3 and tinyllama
You can modify the model file (i.e., vllm/model_executor/models/your_tinyllama_model.py) to concat the embed_tokens.weight with padding tensor. Then change the vocab_size param to align with your llama3 model in your_tinyllama_path/config.py to work around.
Found a way to work around.
Have you solved the problem of inconsistent vocab_size? If you solved it, can you show me how you did it?
I have a simple work around.
my case:
method:
vocab_size
config of draft model (Qwen1.5-0.5B-Chat/config.json
) to 152064vllm/model_executor/layers/vocab_parallel_embedding.py
: # original
assert loaded_weight.shape[output_dim] == self.org_vocab_size
# modified
assert loaded_weight.shape[output_dim] <= self.org_vocab_size # from `==` to `<=`
# original
loaded_weight = loaded_weight.narrow(output_dim, start_idx, shard_size)
# modified
if shard_size <= loaded_weight.shape[0]:
loaded_weight = loaded_weight.narrow(output_dim, start_idx, shard_size)
That's all.
I have a simple work around.
my case:
- vllm version: 0.5.1
- draft model: Qwen1.5-0.5B-Chat, with vocab size 151936
- target model: Qwen1.5-14B-Chat, with vocab size 152064
method:
- update the
vocab_size
config of draft model (Qwen1.5-0.5B-Chat/config.json
) to 152064- modify the following two lines to pad vocab embedding tensor in
vllm/model_executor/layers/vocab_parallel_embedding.py
:# original assert loaded_weight.shape[output_dim] == self.org_vocab_size # modified assert loaded_weight.shape[output_dim] <= self.org_vocab_size # from `==` to `<=`
# original loaded_weight = loaded_weight.narrow(output_dim, start_idx, shard_size) # modified if shard_size <= loaded_weight.shape[0]: loaded_weight = loaded_weight.narrow(output_dim, start_idx, shard_size)
That's all.
I tried the aforementioned method using Qwen2-7B-Instruct
and Qwen2-0.5B-Instruct
, and it works on a single GPU. However, on multiple GPUs, it shows the following error:
I have a simple work around. my case:
- vllm version: 0.5.1
- draft model: Qwen1.5-0.5B-Chat, with vocab size 151936
- target model: Qwen1.5-14B-Chat, with vocab size 152064
method:
- update the
vocab_size
config of draft model (Qwen1.5-0.5B-Chat/config.json
) to 152064- modify the following two lines to pad vocab embedding tensor in
vllm/model_executor/layers/vocab_parallel_embedding.py
:# original assert loaded_weight.shape[output_dim] == self.org_vocab_size # modified assert loaded_weight.shape[output_dim] <= self.org_vocab_size # from `==` to `<=`
# original loaded_weight = loaded_weight.narrow(output_dim, start_idx, shard_size) # modified if shard_size <= loaded_weight.shape[0]: loaded_weight = loaded_weight.narrow(output_dim, start_idx, shard_size)
That's all.
I tried the aforementioned method using
Qwen2-7B-Instruct
andQwen2-0.5B-Instruct
, and it works on a single GPU. However, on multiple GPUs, it shows the following error
Try modifying the line 400 of vllm/model_executor/models/qwen2.py
from weight_loader(param, loaded_weight)
to
if name == "model.embed_tokens.weight":
if loaded_weight.shape[0] == 151936:
padding = torch.zeros((128, 1024), dtype=loaded_weight.dtype, device=loaded_weight.device)
loaded_weight = torch.cat([loaded_weight, padding], dim=0)
weight_loader(param, loaded_weight)
else:
weight_loader(param, loaded_weight)
You might have to do the padding on weight actually to pass tons of tensor_shape assertion checks.
🚀 The feature, motivation and pitch
Currently, vllm with Speculative Decoding requires that the draft model and target model have the same vocab size. However, the target model may have a larger parameter size and use more data to train, so the vocab size may also be slightly larger (If we consider using a 0.5B model as the draft model and a 72B model as the target model). In spec_decode_worker.py this file, line 564:
asserts all(vocab_sizes[0] == vocab_size for vocab_size in vocab_sizes)
making the inference pipeline unable to run. If this line is removed, the program may encounter a reshape error. So is inconsistent vocab sizes implementation possible? Such as using padding for the vocab size for the draft model?Alternatives
No response
Additional context
No response