I believe this could interest you, the paper sounds great. I believe exl2 has a very different approach on quantization, so I don't expect anything from this, simply to share some fresh ideas.
New Quantization Method -- QTIP: Quantization with Trellises and Incoherence Processing
Resources
We're pleased to introduce QTIP, a new LLM quantization algorithm that uses trellis coded quantization and incoherence processing to achieve a state of the art combination of speed and quantization quality.
QTIP has significantly better quality over QuIP# while being just as fast. QTIP is also on par with or better than PV-Tuning while being much faster (~2-3x).
[X] I have looked for similar requests before submitting this one.
[X] I understand that the developers have lives and my issue will be answered when possible.
[X] I understand the developers of this program are human, and I will make my requests politely.
Hello Turboderp,
I believe this could interest you, the paper sounds great. I believe exl2 has a very different approach on quantization, so I don't expect anything from this, simply to share some fresh ideas.
From https://www.reddit.com/r/LocalLLaMA/comments/1ggwrx6/new_quantization_method_qtip_quantization_with/:
New Quantization Method -- QTIP: Quantization with Trellises and Incoherence Processing Resources
We're pleased to introduce QTIP, a new LLM quantization algorithm that uses trellis coded quantization and incoherence processing to achieve a state of the art combination of speed and quantization quality.
Paper (NeurIPS 2024 Spotlight): https://arxiv.org/pdf/2406.11235
Codebase + inference kernels: https://github.com/Cornell-RelaxML/qtip
Prequantized models (including 2 Bit 405B Instruct): https://huggingface.co/collections/relaxml/qtip-quantized-models-66fa253ad3186746f4b62803
QTIP has significantly better quality over QuIP# while being just as fast. QTIP is also on par with or better than PV-Tuning while being much faster (~2-3x).