oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
40.87k stars 5.34k forks source link

QTIP: Quantization with Trellises and Incoherence Processing #6512

Open latheesan-k opened 3 weeks ago

latheesan-k commented 3 weeks ago

Description

Add support for QTIP quantisation?

QTIP, a weight-only large language model (LLM) quantization method that achieves a state-of-the-art combination of quantization quality and speed. QTIP uses incoherence processing to make LLM weight matrices approximately i.i.d Gaussian, and then uses trellis coded quantization (TCQ) to quantize these weights with near-optimal distortion. QTIP solves naive TCQ's inherent slowness by introducing a series of novel compute-based codes for use with the "bitshift trellis."

Additional Context

Paper: https://arxiv.org/abs/2406.11235 Implementation: https://github.com/Cornell-RelaxML/qtip Converted Models: https://huggingface.co/collections/relaxml/qtip-quantized-models-66fa253ad3186746f4b62803