oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
38.45k stars 5.09k forks source link

Add Q4/Q8 cache for llama.cpp #6168

Open GodEmperor785 opened 2 weeks ago

GodEmperor785 commented 2 weeks ago

Description

For some time there is an option to use Q8 and Q4 KV cache in llama.cpp. It is present for example in KoboldCPP and works great there. Using quantized KV cache reduces VRAM requirements to run GGUF models. Similar options already exists for exllamav2 and work great.

Perhaps it would be possible to add options "cache_8bit" and "cache_4bit" also to llamacpp loaders, so users are able to use quantized KV cache like in KoboldCPP?

Additional Context

Below are my notes from trying to add this. I'm not really familiar with coding in webui but I tried adding the options in a quick way but I only managed to get Q8 cache to work.

From what I can see the Llama class in llamacpp has these two parameters: type_k: Optional[int] = None, type_v: Optional[int] = None,

Setting these to value like 8 seems to just work and reduce VRAM usage for KV cache by half. I managed to use this in webui by just setting these inside modules/llamacpp_model.py from_pretrained method: params["type_k"] = 8 params["type_v"] = 8 For this to work the option flash_attn is required or else it throws errors, so perhaps some warning for users should be added to. I tested it briefly to work with a few GGUF models, all worked without any problems.

However for Q4 cache (params["type_X"] = 4) it didn't work so most likely some more work is needed here and I don't know enough to figure out why it doesn't work. I'm also not sure about llamacpp_HF loader as I have never used it.

oobabooga commented 2 weeks ago

What are the most common combinations of these parameters?

GodEmperor785 commented 2 weeks ago

I did some digging in koboldcpp and llamacpp code to figure out what values they accept, as for some reason type_k/type_v = 4 fails with strange errors.

In llamacpp options I found this: "KV cache data type for K (default: f16, options f32, f16, q8_0, q4_0, q4_1, iq4_nl, q5_0, or q5_1)"

But type_k/type_v in Llama class expects an int so we can't just write "q4_0". So I checked how they did it in koboldcpp and found they use GGML_TYPE_Q4_0, GGML_TYPE_Q8_0 and GGML_TYPE_F16. These seem to be defined in ggml_type enum in ggml.h GGML_TYPE_Q8_0 just happens to have value 8, so adding 8 in type_k/type_v gives Q8 cache. But Q4 has value 2.

So koboldcpp uses 8 and 2 for their cache. And llamacpp itself seems to support: 0 (f32), 1 (f16, default), 8 (Q8), 2 (Q4), 3 (Q4_1), 20 (IQ4_NL), 6 (Q5_0) and 7 (Q5_1). That is a lot of options and apparently they could even be mixed according to github issues in llamacpp.

I think webui should allow at least Q8 and Q4 cache (like exllamav2 and koboldcpp), so that would be: for Q8: params["type_k"] = 8 params["type_v"] = 8 and for Q4: params["type_k"] = 2 params["type_v"] = 2 If these params remain unset it will default to fp16 (current behavior). I tried to hardcode those 2 and load some gguf (llama3 8B finetune) + test if model is coherent on simple question, both of these worked fine. Tests were done on nvidia GPU, both with all layers offloaded and none. And as I mentioned before both of these require flash_attn to be checked too.

What do you think about this?

oobabooga commented 2 weeks ago

That's exactly the information I needed. Thanks for digging into this.

Based on what you found, I have reused the --cache_4bit and --cache_8bit options for exllamav2 for llama.cpp in this commit: https://github.com/oobabooga/text-generation-webui/commit/4ea260098f297900853912eeb9e13c0381eb483d

It's in the dev branch now. It seems to work well.