Open GodEmperor785 opened 4 months ago
What are the most common combinations of these parameters?
I did some digging in koboldcpp and llamacpp code to figure out what values they accept, as for some reason type_k/type_v = 4 fails with strange errors.
In llamacpp options I found this:
"KV cache data type for K (default: f16
, options f32
, f16
, q8_0
, q4_0
, q4_1
, iq4_nl
, q5_0
, or q5_1
)"
But type_k/type_v in Llama class expects an int so we can't just write "q4_0". So I checked how they did it in koboldcpp and found they use GGML_TYPE_Q4_0, GGML_TYPE_Q8_0 and GGML_TYPE_F16. These seem to be defined in ggml_type enum in ggml.h GGML_TYPE_Q8_0 just happens to have value 8, so adding 8 in type_k/type_v gives Q8 cache. But Q4 has value 2.
So koboldcpp uses 8 and 2 for their cache. And llamacpp itself seems to support: 0 (f32), 1 (f16, default), 8 (Q8), 2 (Q4), 3 (Q4_1), 20 (IQ4_NL), 6 (Q5_0) and 7 (Q5_1). That is a lot of options and apparently they could even be mixed according to github issues in llamacpp.
I think webui should allow at least Q8 and Q4 cache (like exllamav2 and koboldcpp), so that would be: for Q8: params["type_k"] = 8 params["type_v"] = 8 and for Q4: params["type_k"] = 2 params["type_v"] = 2 If these params remain unset it will default to fp16 (current behavior). I tried to hardcode those 2 and load some gguf (llama3 8B finetune) + test if model is coherent on simple question, both of these worked fine. Tests were done on nvidia GPU, both with all layers offloaded and none. And as I mentioned before both of these require flash_attn to be checked too.
What do you think about this?
That's exactly the information I needed. Thanks for digging into this.
Based on what you found, I have reused the --cache_4bit
and --cache_8bit
options for exllamav2 for llama.cpp in this commit: https://github.com/oobabooga/text-generation-webui/commit/4ea260098f297900853912eeb9e13c0381eb483d
It's in the dev branch now. It seems to work well.
There are more options you can do than just all 4 or all 8 bit. Unfortunately to make it work, you have to compile llama.cpp with more kernels. The K is more sensitive and may cause issues so I have been running Q8 K and Q4 V: https://github.com/ggerganov/llama.cpp/pull/7412#issuecomment-2120427347
Speaking of which my folks, it would be nice if all of these automatically selected (if not even greyed out) flash attention.
llama_new_context_with_model: V cache quantization requires flash_attn
is thrown otherwise.
Description
For some time there is an option to use Q8 and Q4 KV cache in llama.cpp. It is present for example in KoboldCPP and works great there. Using quantized KV cache reduces VRAM requirements to run GGUF models. Similar options already exists for exllamav2 and work great.
Perhaps it would be possible to add options "cache_8bit" and "cache_4bit" also to llamacpp loaders, so users are able to use quantized KV cache like in KoboldCPP?
Additional Context
Below are my notes from trying to add this. I'm not really familiar with coding in webui but I tried adding the options in a quick way but I only managed to get Q8 cache to work.
From what I can see the Llama class in llamacpp has these two parameters: type_k: Optional[int] = None, type_v: Optional[int] = None,
Setting these to value like 8 seems to just work and reduce VRAM usage for KV cache by half. I managed to use this in webui by just setting these inside modules/llamacpp_model.py from_pretrained method: params["type_k"] = 8 params["type_v"] = 8 For this to work the option flash_attn is required or else it throws errors, so perhaps some warning for users should be added to. I tested it briefly to work with a few GGUF models, all worked without any problems.
However for Q4 cache (params["type_X"] = 4) it didn't work so most likely some more work is needed here and I don't know enough to figure out why it doesn't work. I'm also not sure about llamacpp_HF loader as I have never used it.