Closed abhibst closed 5 months ago
Hi @abhibst, we're in the process of adding speculative decoding using Medusa and other adapter-based methods. We're open to exploring n-gram speculation as well, but based on some of the benchmarks we've seen, it may not actually help especially if you're already compute-bound. Definitely something we're open to investigating after landing the adapter-based work.
hi , team any update on this .
Hey @abhibst, PR #372 adds support for Medusa. ngram speculation will be coming immediately right after as a follow-up.
Hey @abhibst, PR #375 adds support for ngram speculation. Should be able to land this today!
Thanks @tgaddair
System Info
using latest docker image
Information
Tasks
Reproduction
similar to this param with TGI https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher#speculate
Expected behavior
Expecting to run with no error while we use, --speculate 3 param
from our experiment with TGI it did increase the TPS speed around 2X with mixtral 7X8