[Bug]: awq marlin error for deepseek v2 lite

Your current environment

vllm==0.6.3.post1

Model Input Dumps

ValueError: Weight input_size_per_partition = 10944 is not divisible by min_thread_k = 128. Consider reducing tensor_parallel_size or running with --quantization gptq.

🐛 Describe the bug

When I run the command on one gpu:

python -m vllm.entrypoints.openai.api_server --port 10086 --model TechxGenus/DeepSeek-Coder-V2-Lite-Instruct-AWQ --dtype float16 --gpu-memory-utilization 0.8 --max-model-len 8192 --enable-prefix-caching --disable-log-requests --trust-remote-code --enforce-eager

It raise the error:

ValueError: Weight input_size_per_partition = 10944 is not divisible by min_thread_k = 128. Consider reducing tensor_parallel_size or running with --quantization gptq.

Then I try to change the configuration of marlin kernel at https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/utils/marlin_utils.py#L14:

-GPTQ_MARLIN_MIN_THREAD_K = 128
+GPTQ_MARLIN_MIN_THREAD_K = 64

It run successfully, and the output also looks well. Will this have any other impact on the marlin kernel? If there is no other impact, I hope this change can be applied to support it.

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

vllm-project / vllm

[Bug]: awq marlin error for deepseek v2 lite #9913

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...