vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.76k stars 3.92k forks source link

[Doc]: Marlin does not support weight_bits = uint4b8 #8149

Closed xiaotukuaipao12318 closed 1 week ago

xiaotukuaipao12318 commented 1 week ago

📚 The doc issue

使用 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 17003 --model /data/Qwen2-7B-Instruct-GPTQ-Int4 --served-model-name Qwen 命令 加载下载的Qwen2-7B-Instruct-GPTQ-Int4模型时,显示Marlin 量化异常。 错误如下: lf.model = get_model(model_config=self.model_config, File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model return loader.load_model(model_config=model_config, File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 324, in load_model model = _initialize_model(model_config, self.load_config, File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 152, in _initialize_model quant_config = _get_quantization_config(model_config, load_config) File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 93, in _get_quantization_config quant_config = get_quant_config(model_config, load_config) File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/weight_utils.py", line 132, in get_quant_config return quant_cls.from_config(hf_quant_config) File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/gptq_marlin.py", line 84, in from_config return cls(weight_bits, group_size, desc_act, is_sym, File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/gptq_marlin.py", line 51, in init verify_marlin_supported(quant_type=self.quant_type, File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 88, in verify_marlin_supported raise ValueError(err_msg) ValueError: Marlin does not support weight_bits = uint4b8. Only types = [] are supported (for group_size = 128, min_capability = 75, zp = False).

Suggest a potential alternative/fix

No response

Before submitting a new issue...

youkaichao commented 1 week ago

cc @mgoin

mgoin commented 1 week ago

Based on min_capability = 75 is it the case that you are using a T4 GPU? This should be failing before getting to this point since Marlin only supports >=80

mgoin commented 1 week ago

Thanks for finding this issue, I have resolved it in the attached PR above ^