vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.68k stars 4.08k forks source link

[Feature Request] Adding Eagle, Medusa, Look Ahead decoding ( improvements of Speculative decoding) #2791

Open HamidShojanazeri opened 7 months ago

HamidShojanazeri commented 7 months ago

Thanks for the great work team. I wonder if there is any plan to add new improvements to speculative decoding such as Eagle, Medusa, look ahead decoding. These could result in accumulative speed ups for VLLM.

cc: @WoosukKwon

simon-mo commented 7 months ago

Yes. The plan is here #2188

HamidShojanazeri commented 7 months ago

thanks for sharing @simon-mo that sounds great! I also wonder if newer methods also can improve on speculative decoding with removing the need for a draft model and we are exploring that path as well?

simon-mo commented 7 months ago

The speculative decoding framework is designed to support a wide range of draft model and draft model free algorithms. Once the immediate features are in place (by @cadedaniel), we welcome community's contribution for more methods!

cadedaniel commented 7 months ago

Correct! And yes, speculation methods without a draft model have benefits in both performance and usability. Unclear right now which specific approach will end up being the best but vLLM should support it.

caliber1313 commented 7 months ago

I would like to suggest Hydra in your project alongside with medusa. Please find hydra repository here: https://github.com/zankner/Hydra.

Thank you for your consideration

josephrocca commented 3 months ago

For those interested in some ranking data of the different methods, below is a copy-paste from a neat project by @hemingkx called Spec-Bench. The ranking when running 33B models is similar. Please see the linked repo for latest data - just pasting here for those who are skimming this thread.

Models Multi-turn Conversation Translation Summa-rization Question Answering Mathematical Reasoning Retrieval-aug. Generation #Mean Accepted Tokens Overall
EAGLE🏅 2.44x 1.81x 2.13x 2.11x 2.54x 1.82x 3.57 2.16x
SpS🥈 1.98x 1.37x 2.00x 1.95x 1.89x 1.76x 2.29 1.83x
Hydra🥉 2.04x 1.67x 1.56x 1.81x 2.16x 1.48x 3.26 1.80x
PLD 1.57x 1.07x 2.31x 1.25x 1.62x 1.56x 1.74 1.55x
Medusa 1.60x 1.38x 1.28x 1.46x 1.64x 1.22x 2.32 1.44x
REST 1.49x 1.18x 1.21x 1.46x 1.35x 1.27x 1.63 1.32x
Lookahead 1.13x 0.97x 1.05x 1.07x 1.29x 0.98x 1.65 1.08x