Qwen2.5-Coder-32B + Qwen2.5-Coder-1.5B as draft model
Describe the bug
When loading the Qwen2.5-Coder-32B model through Exui, I am getting around 20 tokens/s with a 4_25 bpw quant (unrelated, but this is also lacking compared to the 25+ I am seeing with llamacpp). However, when loading the 1.5B version of the model as a draft model, performance drops down to below 16 tokens/sec instead of experiencing a speedup. I do experience a speedup with llamacpp speculative decoding (a little over 2x).
Reproduction steps
Load the 32B model through exui
Ask for a story in a chat
See around 20 tokens/second
Unload the 32B model
Load the 32B model + 1.5B draft model
Ask for another story in a new chat
Expected behavior
A speed boost is obtained.
Actual outcome: The performance regresses.
Logs
No response
Additional context
No response
Acknowledgements
[X] I have looked for similar issues before submitting this one.
[X] I understand that the developers have lives and my issue will be answered when possible.
[X] I understand the developers of this program are human, and I will ask my questions politely.
OS
Linux
GPU Library
AMD ROCm
Python version
3.12
Pytorch version
Pulled from https://download.pytorch.org/whl/rocm6.2 yesterday
Model
Qwen2.5-Coder-32B + Qwen2.5-Coder-1.5B as draft model
Describe the bug
When loading the Qwen2.5-Coder-32B model through Exui, I am getting around 20 tokens/s with a 4_25 bpw quant (unrelated, but this is also lacking compared to the 25+ I am seeing with llamacpp). However, when loading the 1.5B version of the model as a draft model, performance drops down to below 16 tokens/sec instead of experiencing a speedup. I do experience a speedup with llamacpp speculative decoding (a little over 2x).
Reproduction steps
Expected behavior
A speed boost is obtained.
Actual outcome: The performance regresses.
Logs
No response
Additional context
No response
Acknowledgements