Feature request: EAGLE - Githubissues

Well, this is another speculative method that still relies on generating a draft, they just have a different take on how to do that. What they're doing is essentially a smarter version of Medusa, but it's not given that it would perform better than speculative decoding which is already supported. You can use draft models of a comparable size (some interesting ones here for instance) for decent results.

One thing to keep in mind with all these techniques is that quantized inference is already a lot faster than the FP16 inference they always compare against in these tests, and the speedup from any kind of speculative method is going to be less significant because of it. If they're seeing a 4x improvement on FP16, that may only translate to a 2x improvement on 4-bit inference, for instance.

Ultimately, the biggest obstacle is that ExLlamaV2 still leans heavily on Flash Attention, and FA doesn't support arbitrary attention masking. Which means a tree of draft tokens just doesn't work. But if it did, SD could be sped up substantially as well.

turboderp / exllamav2

Feature request: EAGLE #244