QTIP: Quantization with Trellises and Incoherence Processing

Description

Add support for QTIP quantisation?

QTIP, a weight-only large language model (LLM) quantization method that achieves a state-of-the-art combination of quantization quality and speed. QTIP uses incoherence processing to make LLM weight matrices approximately i.i.d Gaussian, and then uses trellis coded quantization (TCQ) to quantize these weights with near-optimal distortion. QTIP solves naive TCQ's inherent slowness by introducing a series of novel compute-based codes for use with the "bitshift trellis."

Additional Context

Paper: https://arxiv.org/abs/2406.11235 Implementation: https://github.com/Cornell-RelaxML/qtip Converted Models: https://huggingface.co/collections/relaxml/qtip-quantized-models-66fa253ad3186746f4b62803

oobabooga / text-generation-webui

QTIP: Quantization with Trellises and Incoherence Processing #6512