Closed hayleyhu closed 1 month ago
Related issue: https://github.com/vllm-project/llm-compressor/issues/54
I have tried changing every thread_k = 128 to 32, but still failed starting the server
The marlin kernel has some limitations on the shapes that it can support.
Unfortunately the shapes of the Qwen
matrices are not a power of two, which is unfortunate. For this model,
--tensor-parallel-size 2
is the maximum tp size that can run with marlin.
We will:
Closing the issue - for kernel support requests, please open an issue in vllm-project/vllm
Describe the bug Can not server Qwen2-72b.W8A16 compressed model with vllm server
Expected behavior The model can be served in vllm server.
Environment Include all relevant environment information:
f7245c8
]: I used https://huggingface.co/neuralmagic/Qwen2-72B-Instruct-quantized.w8a16To Reproduce Exact steps to reproduce the behavior: clone th evllm repo
Errors If applicable, add a full print-out of any errors or exceptions that are raised or include screenshots to help explain your problem.
After I change
GPTQ_MARLIN_MIN_THREAD_K = 32
to gptq_marlin.py, I got a different errorAdditional context Add any other context about the problem here. Also include any relevant files.