turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.72k stars 283 forks source link

[BUG] Speculative decoding regresses performance on 7900 xtx under ROCM #685

Open Mushoz opened 6 days ago

Mushoz commented 6 days ago

OS

Linux

GPU Library

AMD ROCm

Python version

3.12

Pytorch version

Pulled from https://download.pytorch.org/whl/rocm6.2 yesterday

Model

Qwen2.5-Coder-32B + Qwen2.5-Coder-1.5B as draft model

Describe the bug

When loading the Qwen2.5-Coder-32B model through Exui, I am getting around 20 tokens/s with a 4_25 bpw quant (unrelated, but this is also lacking compared to the 25+ I am seeing with llamacpp). However, when loading the 1.5B version of the model as a draft model, performance drops down to below 16 tokens/sec instead of experiencing a speedup. I do experience a speedup with llamacpp speculative decoding (a little over 2x).

Reproduction steps

  1. Load the 32B model through exui
  2. Ask for a story in a chat
  3. See around 20 tokens/second
  4. Unload the 32B model
  5. Load the 32B model + 1.5B draft model
  6. Ask for another story in a new chat

Expected behavior

A speed boost is obtained.

Actual outcome: The performance regresses.

Logs

No response

Additional context

No response

Acknowledgements

Originalimoc commented 4 days ago

Set lower max seq len, recommend 16k.