Open kevinhu opened 7 months ago
https://github.com/vllm-project/vllm/pull/2188 introduces a framework for verifying proposal tokens. Once it's merged then PLD is not very difficult to add.
Hello @cadedaniel , thank you and the vLLM team for creating a great library. I did see that vLLM supported speculative decoding but I couldn't find any documentation on how to use this feature, nor prompt-lookup-decoding. Can you give me an example of how to use this feature in a simple way?
2188 introduces a framework for verifying proposal tokens. Once it's merged then PLD is not very difficult to add.
Prompt lookup decoding (PLD) is a variant of speculative decoding that replaces the draft model with a prefix lookup in the current sequence, resulting in a 2-4x throughput boost for input-grounded tasks like summarization and code modification.
Because PLD doesn't require a secondary model, it might be easier to implement in VLLM?
See https://github.com/apoorvumang/prompt-lookup-decoding for details.