vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
25.91k stars 3.79k forks source link

[Usage]: how to initiate the gemma2-27b with a 4-bit quantization? #6068

Open maxin9966 opened 2 months ago

maxin9966 commented 2 months ago

Your current environment

how to initiate the gemma2-27b with a 4-bit quantization?

How would you like to use vllm

Could you please explain how to initiate the gemma2-27b with a 4-bit quantization?

Qubitium commented 2 months ago

GPTQModel v0.9.3 added Gemma 2 supuport for gptq 4bit quantization but the 27B model has inference issues though we haven't had time to test if vllm has similar infernece issue for the 27B model as HF transformers. 9B model is perfect though and passing with flying colors.

You can try quantizing a 27B with GPTQModel (use format=FORMAT.GPTQ, sym=True) and then try inferencing with vLLM. Let me know if you get it working.

SJY8460 commented 2 months ago

I have a similar question. Whether vllm can directly use "load_in_4bit" to load quantizated model? If not, will it be implemented in the future?

yechenzhi commented 1 month ago

Hi, same question here~