Closed cikkle closed 8 months ago
SD is always hit-and-miss. It requires a draft model that inferences very quickly compared to the full model while guessing correctly enough of the time to make up for that overhead. It also depends on what's being generated, and the sampling settings, both of which influence how likely it is for the model and draft model to end up agreeing with each other.
Of course, there's also a potential problem if the draft model doesn't scale correctly. It's supposed to calculate NTK alpha automatically to extend the draft model's context length to align with the main model, otherwise the draft just becomes useless after 2048 tokens (in the case of TinyLlama). I'll try to investigate if something's maybe broken there for Tabby.
Seems to be the automatic scaling. I just tried manually setting draft_rope_alpha: 2.0
in tabby's config.yml instead of relying on the default behavior and then repeated the test above, and I'm now regularly getting around 13 or 14 t/s.
Hmm. @bdashore3 does the loader set draft_rope_alpha
to anything by default?
Running a 1.1B / 5.0 bpw draft model alongside a 70B / 4.625 bpw model on a dual 7900 XTX system (ROCm 5.7.1, amdgpu driver 6.2.4).
On a near empty context I get a slight boost from around 12 t/s to ~18 t/s or so, but as the context fills up (to the standard 4096), the performance degrades quickly, ending up at around or slightly worse than where I'd be with it off. Based on others' reported experiences I expected it to improve performance even at full context.
The screenshot below is me constantly regenerating responses at the end of a lengthy chat with SD off. Then I turned on SD and continued regenerating messages: