Closed yhyu13 closed 9 months ago
Feel free to submit a PR. Most immediate obstacle for lookahead decoding is the lack of support from Flash Attention which doesn't allow for custom attention masks.
Currently ExLlamaV2 supports speculative decoding which gives a similar speedup to LADE or Medusa. It also allows sampling, and it takes advantage of flash-attn.
Yep, LADE would need do_Sample turning off. Seems speculative decoding is the goto from the get-go.
Hi,
A recent technique called could speed up inferencing further than almost 100%
https://github.com/hao-ai-lab/LookaheadDecoding
That repo show we mostly need to do this for any models adapted to this technique (now onloy LLaMA models and greedy search generation are supported)
It would be great if the same technique is implemented to exllamav2 as well!
Thanks!