turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.69k stars 283 forks source link

[PAPER] New quant method with SOTA quality and speed: QTIP #668

Open TyraVex opened 3 weeks ago

TyraVex commented 3 weeks ago

Hello Turboderp,

I believe this could interest you, the paper sounds great. I believe exl2 has a very different approach on quantization, so I don't expect anything from this, simply to share some fresh ideas.

From https://www.reddit.com/r/LocalLLaMA/comments/1ggwrx6/new_quantization_method_qtip_quantization_with/:

New Quantization Method -- QTIP: Quantization with Trellises and Incoherence Processing Resources

We're pleased to introduce QTIP, a new LLM quantization algorithm that uses trellis coded quantization and incoherence processing to achieve a state of the art combination of speed and quantization quality.

Paper (NeurIPS 2024 Spotlight): https://arxiv.org/pdf/2406.11235

Codebase + inference kernels: https://github.com/Cornell-RelaxML/qtip

Prequantized models (including 2 Bit 405B Instruct): https://huggingface.co/collections/relaxml/qtip-quantized-models-66fa253ad3186746f4b62803

QTIP has significantly better quality over QuIP# while being just as fast. QTIP is also on par with or better than PV-Tuning while being much faster (~2-3x).