Closed wasertech closed 9 months ago
@cadedaniel is in charge of adding overall support for speculative decoding here: https://github.com/vllm-project/vllm/pull/2188, I would imagine after this PR, ngram support should be very straightforward.
@simon-mo Thanks for letting me know!
I have not looked it up yet but I think it wouldn't be too complicated to add a parameter to vLLM so that we can use speculative decoding w/ vLLM.
Like always it's a bit more complicated than I initially anticipated but I am glad to see its in the works.
I'll close this issue even if its not there yet, as the community already knows about it and is well on its way to archive it; speculative decoding w/ vLLM. 🎉
thanks for bringing this up @wasertech ! we have an internal prototype for exactly this and it shows good results, but it's blocked on https://github.com/vllm-project/vllm/pull/2188 at the moment
Looking forward to test it on my hardware. I am training atm, but I will give your branch a try later @cadedaniel Thanks for your amazing contribution 🚀!
You know what lets keep this issue open so that people who are wondering too know what's up. I (or someone with auth) can close it once #2188 (and the PR that uses it to introduce ngram speculation) is are merged ^^
Closing as duplicate, see https://github.com/vllm-project/vllm/issues/1802
So transformers has introduced support for speculative decoding of ngrams.
https://github.com/huggingface/transformers/pull/27979
It's as simple as passing
prompt_lookup_num_tokens=10
tomodel.generate
in newer version of transformers.Why would this be useful?
Most often it will speed up inference by up to 3x!
I have not looked it up yet but I think it wouldn't be too complicated to add a parameter to vLLM so that we can use speculative decoding w/ vLLM. At least the speed up can make the trouble worthwhile.
Let me know what you think.