Add support for ngram speculation with `--speculative-tokens` param

predibase / lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

https://loraexchange.ai

Apache License 2.0

2.08k stars 138 forks source link

Add support for ngram speculation with `--speculative-tokens` param #259

Closed abhibst closed 5 months ago

abhibst commented 6 months ago

System Info

using latest docker image

Information

[X] Docker
[ ] The CLI directly

Tasks

[ ] An officially supported command
[ ] My own modifications

Reproduction

Expected behavior

Expecting to run with no error while we use, --speculate 3 param

from our experiment with TGI it did increase the TPS speed around 2X with mixtral 7X8

jeffreyftang commented 6 months ago

Hi @abhibst, we're in the process of adding speculative decoding using Medusa and other adapter-based methods. We're open to exploring n-gram speculation as well, but based on some of the benchmarks we've seen, it may not actually help especially if you're already compute-bound. Definitely something we're open to investigating after landing the adapter-based work.

abhibst commented 6 months ago

hi , team any update on this .

tgaddair commented 5 months ago

Hey @abhibst, PR #372 adds support for Medusa. ngram speculation will be coming immediately right after as a follow-up.

tgaddair commented 5 months ago

Hey @abhibst, PR #375 adds support for ngram speculation. Should be able to land this today!

abhibst commented 5 months ago

Thanks @tgaddair