turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.28k stars 243 forks source link

Feature request: EAGLE #244

Closed vt404v2 closed 3 months ago

vt404v2 commented 6 months ago

Hello! I found a speedup method that speeds up generation by up to 3 times: https://github.com/SafeAILab/EAGLE. Their project page: https://sites.google.com/view/eagle-llm

I think it will be possible to use it with exllama and significantly speed up exllama. It's kind of like an assistant model, but they need to train a small model for each model you run. I think training a small model can be automated in exllama. Can you please take a look at this and perhaps implement it, or at least add the ability to run your own trained small models along with the main one, like in EAGLE?

turboderp commented 6 months ago

Well, this is another speculative method that still relies on generating a draft, they just have a different take on how to do that. What they're doing is essentially a smarter version of Medusa, but it's not given that it would perform better than speculative decoding which is already supported. You can use draft models of a comparable size (some interesting ones here for instance) for decent results.

One thing to keep in mind with all these techniques is that quantized inference is already a lot faster than the FP16 inference they always compare against in these tests, and the speedup from any kind of speculative method is going to be less significant because of it. If they're seeing a 4x improvement on FP16, that may only translate to a 2x improvement on 4-bit inference, for instance.

Ultimately, the biggest obstacle is that ExLlamaV2 still leans heavily on Flash Attention, and FA doesn't support arbitrary attention masking. Which means a tree of draft tokens just doesn't work. But if it did, SD could be sped up substantially as well.