Open rghosh08 opened 4 months ago
Try GPT-Q? We support 2/3/4/8 bits.
Try GPT-Q? We support 2/3/4/8 bits.
@simon-mo is it possible to support eetq, like huggingface/text-generation-inference?
https://github.com/NetEase-FuXi/EETQ
It's super useful because you don't even need an offline quantization step, you just point it at a normal unquantized model and pass --quantize eetq
and then magically you use half the vram and get super fast inference with very little quality impact.
Here's the PR where they added it in TGI: https://github.com/huggingface/text-generation-inference/pull/1068/files
Try GPT-Q? We support 2/3/4/8 bits.
@simon-mo is it possible to support eetq, like huggingface/text-generation-inference?
https://github.com/NetEase-FuXi/EETQ
It's super useful because you don't even need an offline quantization step, you just point it at a normal unquantized model and pass
--quantize eetq
and then magically you use half the vram and get super fast inference with very little quality impact.Here's the PR where they added it in TGI: https://github.com/huggingface/text-generation-inference/pull/1068/files
Good idea. Is it possible to also integrate the W4A16kernel optimization in tensorrtllm?
That's a good idea. EETQ works out of the box and we'd like to integrate it into vLLM.
Does vLLM support 8 bit quantization? We need to use vLLM with large context window (>1K tokens). We tried AWQ but the generation quality is not good. Any pointer will be greatly appreciated.