vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.08k stars 3.12k forks source link

Feature request: prompt lookup decoding #1802

Open kevinhu opened 7 months ago

kevinhu commented 7 months ago

Prompt lookup decoding (PLD) is a variant of speculative decoding that replaces the draft model with a prefix lookup in the current sequence, resulting in a 2-4x throughput boost for input-grounded tasks like summarization and code modification.

Because PLD doesn't require a secondary model, it might be easier to implement in VLLM?

See https://github.com/apoorvumang/prompt-lookup-decoding for details.

cadedaniel commented 5 months ago

https://github.com/vllm-project/vllm/pull/2188 introduces a framework for verifying proposal tokens. Once it's merged then PLD is not very difficult to add.

trinhdoduyhungss commented 2 months ago

Hello @cadedaniel , thank you and the vLLM team for creating a great library. I did see that vLLM supported speculative decoding but I couldn't find any documentation on how to use this feature, nor prompt-lookup-decoding. Can you give me an example of how to use this feature in a simple way?

2188 introduces a framework for verifying proposal tokens. Once it's merged then PLD is not very difficult to add.