vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.27k stars 4.59k forks source link

[Bug]: awq marlin error for deepseek v2 lite #9913

Open TechxGenus opened 2 weeks ago

TechxGenus commented 2 weeks ago

Your current environment

vllm==0.6.3.post1

Model Input Dumps

ValueError: Weight input_size_per_partition = 10944 is not divisible by min_thread_k = 128. Consider reducing tensor_parallel_size or running with --quantization gptq.

🐛 Describe the bug

When I run the command on one gpu:

python -m vllm.entrypoints.openai.api_server --port 10086 --model TechxGenus/DeepSeek-Coder-V2-Lite-Instruct-AWQ --dtype float16 --gpu-memory-utilization 0.8 --max-model-len 8192 --enable-prefix-caching --disable-log-requests --trust-remote-code --enforce-eager

It raise the error:

ValueError: Weight input_size_per_partition = 10944 is not divisible by min_thread_k = 128. Consider reducing tensor_parallel_size or running with --quantization gptq.

Then I try to change the configuration of marlin kernel at https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/utils/marlin_utils.py#L14:

-GPTQ_MARLIN_MIN_THREAD_K = 128
+GPTQ_MARLIN_MIN_THREAD_K = 64

It run successfully, and the output also looks well. Will this have any other impact on the marlin kernel? If there is no other impact, I hope this change can be applied to support it.

Before submitting a new issue...

tohnee commented 2 days ago

thx it helps alot 🌹