vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.5k stars 4.43k forks source link

Add support for prompt-lookup speculative decoding #2469

Closed wasertech closed 9 months ago

wasertech commented 9 months ago

So transformers has introduced support for speculative decoding of ngrams.

https://github.com/huggingface/transformers/pull/27979

It's as simple as passing prompt_lookup_num_tokens=10 to model.generate in newer version of transformers.

Why would this be useful?

Most often it will speed up inference by up to 3x!

I have not looked it up yet but I think it wouldn't be too complicated to add a parameter to vLLM so that we can use speculative decoding w/ vLLM. At least the speed up can make the trouble worthwhile.

Let me know what you think.

simon-mo commented 9 months ago

@cadedaniel is in charge of adding overall support for speculative decoding here: https://github.com/vllm-project/vllm/pull/2188, I would imagine after this PR, ngram support should be very straightforward.

wasertech commented 9 months ago

@simon-mo Thanks for letting me know!

wasertech commented 9 months ago

I have not looked it up yet but I think it wouldn't be too complicated to add a parameter to vLLM so that we can use speculative decoding w/ vLLM.

Like always it's a bit more complicated than I initially anticipated but I am glad to see its in the works.

I'll close this issue even if its not there yet, as the community already knows about it and is well on its way to archive it; speculative decoding w/ vLLM. 🎉

cadedaniel commented 9 months ago

thanks for bringing this up @wasertech ! we have an internal prototype for exactly this and it shows good results, but it's blocked on https://github.com/vllm-project/vllm/pull/2188 at the moment

wasertech commented 9 months ago

Looking forward to test it on my hardware. I am training atm, but I will give your branch a try later @cadedaniel Thanks for your amazing contribution 🚀!

wasertech commented 9 months ago

You know what lets keep this issue open so that people who are wondering too know what's up. I (or someone with auth) can close it once #2188 (and the PR that uses it to introduce ngram speculation) is are merged ^^

cadedaniel commented 9 months ago

Closing as duplicate, see https://github.com/vllm-project/vllm/issues/1802