Open MichoChan opened 3 months ago
@MichoChan could you please share a command for triggering this error so we can reproduce? Is this some model that didn't work for you?
@MichoChan I believe this issue is fixed on current main by https://github.com/vllm-project/vllm/pull/7264
@MichoChan I believe this issue is fixed on current main by #7264
i know, when i use autoawq with zero point true, gemm version, the vllm will convert awq gemm version to awq marlin verison, that looks like fine, but when i quant with autoawq using marlin and no zero point, the vllm will raise error, because vllm only supoort awq marlin with zero point
Can you point me to a model checkpoint without zero point?
Can you point me to a model checkpoint without zero point?
sorry, i have no model checkpoint without zero point that you can get from hub/public site
and i notice that Autoawq quant with marlin which already using marlin format to save model, however, vllm only support a normal awq format, and then auto convert it to marlin format and using marlin kernel.
so can i say that vllm only support a normal awq format and can convert to marlin format when runtime?
+1 here I've been trying to get this going. First here is my quantize.py file for autoawq:
model_path = '/mnt/g/stable-code-instruct-3b' quant_path = '/home/admin/stable_code_marlin'
quant_config = { "zero_point": False, # To use Marlin, you must specify zero point as False and version as Marlin.
The comment is taken directly from AutoAWQ here link
so that's how I'm quantizing the model. then when I call vllm from the CLI like so "vllm serve . --port 9000 --trust-remote-code --quantization awq_marlin --cpu-offload-gb 50 --device auto"
It terminates with this error: "ValueError: Marlin does not support weight_bits = uint4. Only types = [ScalarType.uint4b8, ScalarType.uint8b128] are supported (for group_size = 128, device_capability = 89, zp = False)."
also in order to get this far, I had to manually change the config.json file. Autoawq generates the config.json like this "quant_method": "awq" yet VLLM is expecting "quant_method": "marlin".
In the end you have to manually change to "awq_marlin". Can VLLM code be updated to expect "awq" as the quant method and "marlin" as the version?
this is what the config looks like from AutoAWQ:
"quant_method": "awq", "version": "marlin",
+1 I also got this error. I use SGLang to launch awq_marlin quantized model, but got error, detail case: https://github.com/sgl-project/sglang/issues/1792. Through code analysis, it's found that vllm does not support awq_marlin quantized models with zero_point = false.
Can you point me to a model checkpoint without zero point?
There are some models if seaching 'awq-marlin' in hf hub; Such as
Also, you can quantize any model to awq_marlin format by Autoawq to reproduce this error.
Your current environment
vllm 0.5.4
🐛 Describe the bug
autoawq marlin must with no zero point, but vllm:
this would error### ###