turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.45k stars 257 forks source link

Performance with speculative decoding is slightly worse than without at full context #165

Closed cikkle closed 8 months ago

cikkle commented 9 months ago

Running a 1.1B / 5.0 bpw draft model alongside a 70B / 4.625 bpw model on a dual 7900 XTX system (ROCm 5.7.1, amdgpu driver 6.2.4).

On a near empty context I get a slight boost from around 12 t/s to ~18 t/s or so, but as the context fills up (to the standard 4096), the performance degrades quickly, ending up at around or slightly worse than where I'd be with it off. Based on others' reported experiences I expected it to improve performance even at full context.

The screenshot below is me constantly regenerating responses at the end of a lengthy chat with SD off. Then I turned on SD and continued regenerating messages:

exllama_spec_eval

turboderp commented 9 months ago

SD is always hit-and-miss. It requires a draft model that inferences very quickly compared to the full model while guessing correctly enough of the time to make up for that overhead. It also depends on what's being generated, and the sampling settings, both of which influence how likely it is for the model and draft model to end up agreeing with each other.

Of course, there's also a potential problem if the draft model doesn't scale correctly. It's supposed to calculate NTK alpha automatically to extend the draft model's context length to align with the main model, otherwise the draft just becomes useless after 2048 tokens (in the case of TinyLlama). I'll try to investigate if something's maybe broken there for Tabby.

cikkle commented 9 months ago

Seems to be the automatic scaling. I just tried manually setting draft_rope_alpha: 2.0 in tabby's config.yml instead of relying on the default behavior and then repeated the test above, and I'm now regularly getting around 13 or 14 t/s.

turboderp commented 9 months ago

Hmm. @bdashore3 does the loader set draft_rope_alpha to anything by default?