[BUG] Speculative decoding regresses performance on 7900 xtx under ROCM

OS

Linux

GPU Library

AMD ROCm

Python version

3.12

Pytorch version

Pulled from https://download.pytorch.org/whl/rocm6.2 yesterday

Model

Qwen2.5-Coder-32B + Qwen2.5-Coder-1.5B as draft model

Describe the bug

When loading the Qwen2.5-Coder-32B model through Exui, I am getting around 20 tokens/s with a 4_25 bpw quant (unrelated, but this is also lacking compared to the 25+ I am seeing with llamacpp). However, when loading the 1.5B version of the model as a draft model, performance drops down to below 16 tokens/sec instead of experiencing a speedup. I do experience a speedup with llamacpp speculative decoding (a little over 2x).

Reproduction steps

Load the 32B model through exui
Ask for a story in a chat
See around 20 tokens/second
Unload the 32B model
Load the 32B model + 1.5B draft model
Ask for another story in a new chat

Expected behavior

A speed boost is obtained.

Actual outcome: The performance regresses.

Logs

No response

Additional context

No response

Acknowledgements

[X] I have looked for similar issues before submitting this one.
[X] I understand that the developers have lives and my issue will be answered when possible.
[X] I understand the developers of this program are human, and I will ask my questions politely.

turboderp / exllamav2