ValueError: Weight input_size_per_partition = 10944 is not divisible by min_thread_k = 128. Consider reducing tensor_parallel_size or running with --quantization gptq.
ValueError: Weight input_size_per_partition = 10944 is not divisible by min_thread_k = 128. Consider reducing tensor_parallel_size or running with --quantization gptq.
It run successfully, and the output also looks well.
Will this have any other impact on the marlin kernel? If there is no other impact, I hope this change can be applied to support it.
Before submitting a new issue...
[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Your current environment
vllm==0.6.3.post1
Model Input Dumps
🐛 Describe the bug
When I run the command on one gpu:
It raise the error:
Then I try to change the configuration of marlin kernel at https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/utils/marlin_utils.py#L14:
It run successfully, and the output also looks well. Will this have any other impact on the marlin kernel? If there is no other impact, I hope this change can be applied to support it.
Before submitting a new issue...