vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.37k stars 3.16k forks source link

[Feature]: inconsistent vocab_sizes support for draft and target workers while using Speculative Decoding #5203

Closed ShangmingCai closed 1 month ago

ShangmingCai commented 1 month ago

🚀 The feature, motivation and pitch

Currently, vllm with Speculative Decoding requires that the draft model and target model have the same vocab size. However, the target model may have a larger parameter size and use more data to train, so the vocab size may also be slightly larger (If we consider using a 0.5B model as the draft model and a 72B model as the target model). In spec_decode_worker.py this file, line 564:asserts all(vocab_sizes[0] == vocab_size for vocab_size in vocab_sizes) making the inference pipeline unable to run. If this line is removed, the program may encounter a reshape error. So is inconsistent vocab sizes implementation possible? Such as using padding for the vocab size for the draft model?

Alternatives

No response

Additional context

No response

ShangmingCai commented 1 month ago

Found a way to work around.

devdev999 commented 1 month ago

How did you work around this? Facing the same issue now with llama3 and tinyllama

ShangmingCai commented 1 month ago

How did you work around this? Facing the same issue now with llama3 and tinyllama

You can modify the model file (i.e., vllm/model_executor/models/your_tinyllama_model.py) to concat the embed_tokens.weight with padding tensor. Then change the vocab_size param to align with your llama3 model in your_tinyllama_path/config.py to work around.

AU-CPU commented 9 hours ago

Found a way to work around.

Have you solved the problem of inconsistent vocab_size? If you solved it, can you show me how you did it?