turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.45k stars 257 forks source link

Support LookaheadDecoding? #187

Closed yhyu13 closed 9 months ago

yhyu13 commented 9 months ago

Hi,

A recent technique called could speed up inferencing further than almost 100%

https://github.com/hao-ai-lab/LookaheadDecoding

That repo show we mostly need to do this for any models adapted to this technique (now onloy LLaMA models and greedy search generation are supported)

import os
os.environ["USE_LADE"]="1" 
import lade
lade.augment_all()
lade.config_lade(LEVEL=5, WINDOW_SIZE=7, GUESS_SET_SIZE=7, DEBUG=0) 

It would be great if the same technique is implemented to exllamav2 as well!

Thanks!

turboderp commented 9 months ago

Feel free to submit a PR. Most immediate obstacle for lookahead decoding is the lack of support from Flash Attention which doesn't allow for custom attention masks.

Currently ExLlamaV2 supports speculative decoding which gives a similar speedup to LADE or Medusa. It also allows sampling, and it takes advantage of flash-attn.

yhyu13 commented 9 months ago

Yep, LADE would need do_Sample turning off. Seems speculative decoding is the goto from the get-go.