vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.05k stars 4.72k forks source link

[Bug]: Qwen2.5-32B-GPTQ-Int4 inference `!!!!!` #10656

Open jklj077 opened 4 days ago

jklj077 commented 4 days ago

Your current environment

The output of `python collect_env.py` N/A; happened to multiple users.

Model Input Dumps

No response

🐛 Describe the bug

We have been receiving reports that the 4-bit GPTQ version of Qwen2.5-32B-Instruct cannot be used with vllm. The generation only contains !!!!!. However, it was also reported that the same model worked using transformers and auto_gptq.

Here are some related issues:

We attempted to reproduce the issue, which appears related to quantization kernels, and the following is a summary:

The results are consistent for

As gpt_marlin is not available for turing and volta cards, we are not able to find a workaround for those users. It would help a lot if one could help investigate the issue.

Before submitting a new issue...

youkaichao commented 4 days ago

cc @robertgshaw2-neuralmagic

youqugit commented 3 days ago

I encountered the same issue, only the /chat/completions endpoint returns an error that many !!!!!, while the /completions endpoint works fine.

vLLM version: 0.6.1

DarkLight1337 commented 3 days ago

Also cc @mgoin

mgoin commented 3 days ago

As far as I can tell the gptq kernel hasn't been touched all year, the last change was https://github.com/vllm-project/vllm/pull/2330 by @chu-tianxiang

This may be a fundamental issue with the kernel for this model, someone would need to dive in and learn about it.