vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.89k stars 4.7k forks source link

[Bug]: AutoAWQ marlin methods error #7517

Open MichoChan opened 3 months ago

MichoChan commented 3 months ago

Your current environment

vllm 0.5.4

🐛 Describe the bug

autoawq marlin must with no zero point, but vllm:

def query_marlin_supported_quant_types(has_zp: bool,
                                       min_capability: Optional[int] = None):
    if min_capability is None:
        major, minor = current_platform.get_device_capability()
        min_capability = major * 10 + minor

    if min_capability < 80:
        return []

    if has_zp:
        # AWQ style, unsigned + runtime zero-point
        return [scalar_types.uint4, scalar_types.uint8]
    else:
        # GPTQ style, unsigned + symmetric bias
        # TODO: once fp8_marlin is merged into "gptq_marlin" we should be able
        #  to add `scalar_types.float8_e4m3fn` here
        return [scalar_types.uint4b8, scalar_types.uint8b128]`

this would error### ###

mgoin commented 3 months ago

@MichoChan could you please share a command for triggering this error so we can reproduce? Is this some model that didn't work for you?

robertgshaw2-neuralmagic commented 3 months ago

@MichoChan I believe this issue is fixed on current main by https://github.com/vllm-project/vllm/pull/7264

MichoChan commented 3 months ago

@MichoChan I believe this issue is fixed on current main by #7264

i know, when i use autoawq with zero point true, gemm version, the vllm will convert awq gemm version to awq marlin verison, that looks like fine, but when i quant with autoawq using marlin and no zero point, the vllm will raise error, because vllm only supoort awq marlin with zero point

robertgshaw2-neuralmagic commented 3 months ago

Can you point me to a model checkpoint without zero point?

MichoChan commented 3 months ago

Can you point me to a model checkpoint without zero point?

sorry, i have no model checkpoint without zero point that you can get from hub/public site

and i notice that Autoawq quant with marlin which already using marlin format to save model, however, vllm only support a normal awq format, and then auto convert it to marlin format and using marlin kernel.

so can i say that vllm only support a normal awq format and can convert to marlin format when runtime?

ColumbusAI commented 3 months ago

+1 here I've been trying to get this going. First here is my quantize.py file for autoawq:

model_path = '/mnt/g/stable-code-instruct-3b' quant_path = '/home/admin/stable_code_marlin'

quant_config = { "zero_point": False, # To use Marlin, you must specify zero point as False and version as Marlin.

The comment is taken directly from AutoAWQ here link

so that's how I'm quantizing the model. then when I call vllm from the CLI like so "vllm serve . --port 9000 --trust-remote-code --quantization awq_marlin --cpu-offload-gb 50 --device auto"

It terminates with this error: "ValueError: Marlin does not support weight_bits = uint4. Only types = [ScalarType.uint4b8, ScalarType.uint8b128] are supported (for group_size = 128, device_capability = 89, zp = False)."

also in order to get this far, I had to manually change the config.json file. Autoawq generates the config.json like this "quant_method": "awq" yet VLLM is expecting "quant_method": "marlin".

In the end you have to manually change to "awq_marlin". Can VLLM code be updated to expect "awq" as the quant method and "marlin" as the version?

this is what the config looks like from AutoAWQ:

"quant_method": "awq", "version": "marlin",

liangzelang commented 1 month ago

+1 I also got this error. I use SGLang to launch awq_marlin quantized model, but got error, detail case: https://github.com/sgl-project/sglang/issues/1792. Through code analysis, it's found that vllm does not support awq_marlin quantized models with zero_point = false.

liangzelang commented 1 month ago

Can you point me to a model checkpoint without zero point?

There are some models if seaching 'awq-marlin' in hf hub; Such as

Also, you can quantize any model to awq_marlin format by Autoawq to reproduce this error.