Closed ShangmingCai closed 1 month ago
Found a way to work around.
How did you work around this? Facing the same issue now with llama3 and tinyllama
How did you work around this? Facing the same issue now with llama3 and tinyllama
You can modify the model file (i.e., vllm/model_executor/models/your_tinyllama_model.py) to concat the embed_tokens.weight with padding tensor. Then change the vocab_size param to align with your llama3 model in your_tinyllama_path/config.py to work around.
Found a way to work around.
Have you solved the problem of inconsistent vocab_size? If you solved it, can you show me how you did it?
🚀 The feature, motivation and pitch
Currently, vllm with Speculative Decoding requires that the draft model and target model have the same vocab size. However, the target model may have a larger parameter size and use more data to train, so the vocab size may also be slightly larger (If we consider using a 0.5B model as the draft model and a 72B model as the target model). In spec_decode_worker.py this file, line 564:
asserts all(vocab_sizes[0] == vocab_size for vocab_size in vocab_sizes)
making the inference pipeline unable to run. If this line is removed, the program may encounter a reshape error. So is inconsistent vocab sizes implementation possible? Such as using padding for the vocab size for the draft model?Alternatives
No response
Additional context
No response