oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
39.68k stars 5.21k forks source link

Add Q6 cache support for ExllamaV2 #6278

Open Lissanro opened 1 month ago

Lissanro commented 1 month ago

Description

I saw mention of Q6 cache support many times, by the ExLlama dev and other people:

But I see no way to enable it in text-generation-webui. I tried to update text-generation-webui and switching to the dev branch, but still could not find the option. It would be great if it is added.

Additional Context

4-bit cache, even work can work great, it can also reduce quality or even break some models. 6-bit cache offers consistent high quality and according to the test results provided in the first link above, may offer the best quality / memory saving ratio.

randoentity commented 1 month ago

Coincidentally I was looking into this. Should have a PR up soon after I test it out a bit.

randoentity commented 1 month ago

https://github.com/oobabooga/text-generation-webui/pull/6280