Open jklj077 opened 4 days ago
cc @robertgshaw2-neuralmagic
I encountered the same issue, only the /chat/completions
endpoint returns an error that many !!!!!
, while the /completions
endpoint works fine.
vLLM version: 0.6.1
Also cc @mgoin
As far as I can tell the gptq kernel hasn't been touched all year, the last change was https://github.com/vllm-project/vllm/pull/2330 by @chu-tianxiang
This may be a fundamental issue with the kernel for this model, someone would need to dive in and learn about it.
Your current environment
The output of `python collect_env.py`
N/A; happened to multiple users.Model Input Dumps
No response
🐛 Describe the bug
We have been receiving reports that the 4-bit GPTQ version of Qwen2.5-32B-Instruct cannot be used with
vllm
. The generation only contains!!!!!
. However, it was also reported that the same model worked usingtransformers
andauto_gptq
.Here are some related issues:
We attempted to reproduce the issue, which appears related to quantization kernels, and the following is a summary:
gptq_marlin
worksgptq
fails for requests withlen(prompt_token_ids)<=50
but works for longer input sequencesThe results are consistent for
tensor-parallel-size
: 2, 4, 8vllm
versions: v0.6.1.post2, v0.6.2, v0.6.3.post1, v0.6.4.post1As
gpt_marlin
is not available for turing and volta cards, we are not able to find a workaround for those users. It would help a lot if one could help investigate the issue.Before submitting a new issue...