Qwen 2.5 72B Instruct with draft model whenever it's qwen 2.5 0.5B or 1.5B produces garbage. Sometimes it takes few requests to lost context, but it always going insane and inconsistent with repetitions and garbage with long context conversation (15900 tokens). It's going from okay, that's expected through a lot of typos and wow, such chinese! to infiniti repetition of somehow related trash like
Same model without draft produces consistent and good output (with slower tps 😄 )
Reproduction steps
Using TabbyAPI
config.yml
tensor_parallel: true # I have 3090 + 4090
gpu_split_auto: true
gpu_split: [21.0, 24.0] # It will load draft model and half of main model to 4090 with OS overhead and will not fit 24Gb otherwise
cache_mode: Q6 # Q4 doesn't affect anything
chunk_size: 2048
fasttensors: true # tried false as well
draft_cache_mode: Q6 # Q4 also works same
cuda_malloc_backend: true
uvloop: true
Qwen2.5-72B-Instruct-exl2 - 4.0bpw from exllama 2.4.3
Qwen2.5-0.5b-instruct-exl2 - 4.0bpw from exllama 2.4.3
Quants created from original model downloaded at same time today from official Qwen repository.
Qwen2.5-72B-Instruct-exl2 without draft model works fine
Generate chat completitions
I'm using Open Web UI, but I think it doesn't matter a lot.
Here is output from tabby api generation settings (everything at default):
Draft model doesn't affect quality of output for base model.
Logs
No response
Additional context
I've noticed in models config that Qwen 72B has slightly bigger vocab_size than Qwen 0.5B. Looks like Qwen models from 0.5 to 14 has "vocab_size": 151936 while 32B and 72B "vocab_size": 152064. I don't know if this may affect generation.
Acknowledgements
[X] I have looked for similar issues before submitting this one.
[X] I understand that the developers have lives and my issue will be answered when possible.
[X] I understand the developers of this program are human, and I will ask my questions politely.
OS
Windows
GPU Library
CUDA 12.x
Python version
3.12
Pytorch version
2.4.1+cu121
Model
Qwen/Qwen2.5-72B-Instruct
Describe the bug
Qwen 2.5 72B Instruct with draft model whenever it's qwen 2.5 0.5B or 1.5B produces garbage. Sometimes it takes few requests to lost context, but it always going insane and inconsistent with repetitions and garbage with long context conversation (15900 tokens). It's going from
okay, that's expected
througha lot of typos
andwow, such chinese!
to infiniti repetition of somehow related trash likeSame model without draft produces consistent and good output (with slower tps 😄 )
Reproduction steps
Using TabbyAPI
config.yml
Load model
Qwen2.5-72B-Instruct-exl2 - 4.0bpw from exllama 2.4.3 Qwen2.5-0.5b-instruct-exl2 - 4.0bpw from exllama 2.4.3
Quants created from original model downloaded at same time today from official Qwen repository. Qwen2.5-72B-Instruct-exl2 without draft model works fine
Generate chat completitions
I'm using Open Web UI, but I think it doesn't matter a lot.
Here is output from tabby api generation settings (everything at default):
Expected behavior
Draft model doesn't affect quality of output for base model.
Logs
No response
Additional context
I've noticed in models config that Qwen 72B has slightly bigger
vocab_size
than Qwen 0.5B. Looks like Qwen models from 0.5 to 14 has"vocab_size": 151936
while 32B and 72B"vocab_size": 152064
. I don't know if this may affect generation.Acknowledgements