oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
39.09k stars 5.16k forks source link

Use HuggingFace's Quanto library KV Cache Quantization for any Transformers-based loader #6126

Open Interpause opened 2 months ago

Interpause commented 2 months ago

Description

HuggingFace's Quanto has implemented 4 bit & 2 bit KV cache quantization compatible with Transformers. See: https://huggingface.co/blog/kv-cache-quantization

I may PR when I've time to experiment.

Interpause commented 1 month ago

Seems definitely possible: https://github.com/Vahe1994/AQLM/issues/85#issuecomment-2194691934

But man they made me project lead for something in uni so im in a time crunch