[PAPER] New quant method with SOTA quality and speed: QTIP

Hello Turboderp,

I believe this could interest you, the paper sounds great. I believe exl2 has a very different approach on quantization, so I don't expect anything from this, simply to share some fresh ideas.

From https://www.reddit.com/r/LocalLLaMA/comments/1ggwrx6/new_quantization_method_qtip_quantization_with/:

New Quantization Method -- QTIP: Quantization with Trellises and Incoherence Processing Resources

We're pleased to introduce QTIP, a new LLM quantization algorithm that uses trellis coded quantization and incoherence processing to achieve a state of the art combination of speed and quantization quality.

Paper (NeurIPS 2024 Spotlight): https://arxiv.org/pdf/2406.11235

Codebase + inference kernels: https://github.com/Cornell-RelaxML/qtip

Prequantized models (including 2 Bit 405B Instruct): https://huggingface.co/collections/relaxml/qtip-quantized-models-66fa253ad3186746f4b62803

QTIP has significantly better quality over QuIP# while being just as fast. QTIP is also on par with or better than PV-Tuning while being much faster (~2-3x).

[X] I have looked for similar requests before submitting this one.
[X] I understand that the developers have lives and my issue will be answered when possible.
[X] I understand the developers of this program are human, and I will make my requests politely.

turboderp / exllamav2

[PAPER] New quant method with SOTA quality and speed: QTIP #668