[Feature Request] Adding Eagle, Medusa, Look Ahead decoding ( improvements of Speculative decoding)

HamidShojanazeri commented 7 months ago

Thanks for the great work team. I wonder if there is any plan to add new improvements to speculative decoding such as Eagle, Medusa, look ahead decoding. These could result in accumulative speed ups for VLLM.

cc: @WoosukKwon

simon-mo commented 7 months ago

Yes. The plan is here #2188

HamidShojanazeri commented 7 months ago

thanks for sharing @simon-mo that sounds great! I also wonder if newer methods also can improve on speculative decoding with removing the need for a draft model and we are exploring that path as well?

simon-mo commented 7 months ago

The speculative decoding framework is designed to support a wide range of draft model and draft model free algorithms. Once the immediate features are in place (by @cadedaniel), we welcome community's contribution for more methods!

cadedaniel commented 7 months ago

Correct! And yes, speculation methods without a draft model have benefits in both performance and usability. Unclear right now which specific approach will end up being the best but vLLM should support it.

caliber1313 commented 7 months ago

I would like to suggest Hydra in your project alongside with medusa. Please find hydra repository here: https://github.com/zankner/Hydra.

Thank you for your consideration

josephrocca commented 3 months ago

For those interested in some ranking data of the different methods, below is a copy-paste from a neat project by @hemingkx called Spec-Bench. The ranking when running 33B models is similar. Please see the linked repo for latest data - just pasting here for those who are skimming this thread.

Device: a single NVIDIA GeForce RTX 3090 GPU (24GB) with 12 CPU cores
Testing environment: Pytorch 2.0.1, under CUDA 11.8
Experimental Settings: Vicuna-7B-v1.3, greedy decoding, FP16 precision, batch size = 1

Models	Multi-turn Conversation	Translation	Summa-rization	Question Answering	Mathematical Reasoning	Retrieval-aug. Generation	#Mean Accepted Tokens	Overall
EAGLE🏅	2.44x	1.81x	2.13x	2.11x	2.54x	1.82x	3.57	2.16x
SpS🥈	1.98x	1.37x	2.00x	1.95x	1.89x	1.76x	2.29	1.83x
Hydra🥉	2.04x	1.67x	1.56x	1.81x	2.16x	1.48x	3.26	1.80x
PLD	1.57x	1.07x	2.31x	1.25x	1.62x	1.56x	1.74	1.55x
Medusa	1.60x	1.38x	1.28x	1.46x	1.64x	1.22x	2.32	1.44x
REST	1.49x	1.18x	1.21x	1.46x	1.35x	1.27x	1.63	1.32x
Lookahead	1.13x	0.97x	1.05x	1.07x	1.29x	0.98x	1.65	1.08x

vllm-project / vllm

[Feature Request] Adding Eagle, Medusa, Look Ahead decoding ( improvements of Speculative decoding) #2791