vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.3k stars 3.14k forks source link

8bit quantization #3261

Open rghosh08 opened 4 months ago

rghosh08 commented 4 months ago

Does vLLM support 8 bit quantization? We need to use vLLM with large context window (>1K tokens). We tried AWQ but the generation quality is not good. Any pointer will be greatly appreciated.

simon-mo commented 4 months ago

Try GPT-Q? We support 2/3/4/8 bits.

andysalerno commented 4 months ago

Try GPT-Q? We support 2/3/4/8 bits.

@simon-mo is it possible to support eetq, like huggingface/text-generation-inference?

https://github.com/NetEase-FuXi/EETQ

It's super useful because you don't even need an offline quantization step, you just point it at a normal unquantized model and pass --quantize eetq and then magically you use half the vram and get super fast inference with very little quality impact.

Here's the PR where they added it in TGI: https://github.com/huggingface/text-generation-inference/pull/1068/files

shiqingzhangCSU commented 3 months ago

Try GPT-Q? We support 2/3/4/8 bits.

@simon-mo is it possible to support eetq, like huggingface/text-generation-inference?

https://github.com/NetEase-FuXi/EETQ

It's super useful because you don't even need an offline quantization step, you just point it at a normal unquantized model and pass --quantize eetq and then magically you use half the vram and get super fast inference with very little quality impact.

Here's the PR where they added it in TGI: https://github.com/huggingface/text-generation-inference/pull/1068/files

Good idea. Is it possible to also integrate the W4A16kernel optimization in tensorrtllm?

SidaZh commented 3 months ago

That's a good idea. EETQ works out of the box and we'd like to integrate it into vLLM.