vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.06k stars 3.97k forks source link

[Feature]: How to run the int4 quantized version of the gemma2-27b model #7125

Open maxin9966 opened 1 month ago

maxin9966 commented 1 month ago

🚀 The feature, motivation and pitch

How to run the int4 quantized version of the gemma2-27b model

Alternatives

No response

Additional context

No response

mgoin commented 1 month ago

Can you share the failure? It should work AFAIK

maxin9966 commented 1 month ago

@mgoin Running gemma-2-27b-gptq through vllm produce all outputs as pad pad pad , I have tested different versions of flashinfer and vllm, and the results are the same, could you tell me how to run the 4-bit quantization version of gemma2-27b-it?

vllm: VLLM_ATTENTION_BACKEND=FLASHINFER CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server --model ModelCloud/gemma-2-27b-it-gptq-4bit --gpu-memory-utilization 0.9 --quantization gptq --host 0.0.0.0 --port 1231 -tp 1 --dtype float16 --served-model-name gpt --trust-remote-code --enable-prefix-caching --enforce-eager

output: [Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='

image

mgoin commented 1 month ago

Do you have a reference of that model working in Transformers or some other inference engine? Maybe this is worth reporting as an issue on that specific model card

maxin9966 commented 1 month ago

@mgoin I can't find any other int4 quantized versions on HF, and everything on HF is in gguf format. Where can I find the int4 version?

mgoin commented 1 month ago

I don't know, that is up to the community to produce an int4 checkpoint. I am just saying we don't know if the output you reported for that checkpoint is normal for that checkpoint or different in vLLM. It says it was made by GPTQModel: https://github.com/ModelCloud/GPTQModel