Closed LLukas22 closed 1 year ago
Is this actually not supported yet?
Either way, probably should keep this in mind: https://github.com/ggerganov/llama.cpp/issues/1919
The solution there was to add checks in the quantization tool to prevent trying to quantize those tensors (or models) with k-quants but a project using GGML at a lower level would need to verify that the rows and columns are divisible by QK_K
.
Is someone already on this?
I'm a Rust noob and would probably have to chicken out when it gets complicated, but I could at least start with it 😃
I would start by taking all #ifdef GGML_USE_K_QUANTS
from llama.cpp, have GPT4 convert them to Rust and try to fit them in the llm codebase.
I think the first thing you'd need to do is to check if the llama.cpp binding llm depends on is compiling with k-quants. If it is, you probably don't have to do anything more than just add the k-quants types to enums where quantization types are currently listed. In the case of the quantization tool (and maybe loading models also) you'd have to check that the tensor rows/columns are divisible by QK_K
(you can possibly just hardcode 256 there).
As long as the binding part is against a new enough version of GGML and it's getting compiled with k-quants this might be a simple change.
@KerfuffleV2 thanks for your input. I might be mistaken/misunderstanding, but I think llm only depends on ggml.c
and reimplements the file loading from llama.cpp
.
At least I'm finding places where the quantization types are listed as enums or constants.
@LLukas22 @philpax could you enlighten us?
I might be mistaken/misunderstanding, but I think llm only depends on ggml.c
Sorry, I might have been incorrect there. I thought I saw a pull a while back saying something like it was now depending on llama.cpp.
You could possibly look at the approach I use here: https://github.com/KerfuffleV2/ggml-sys-bleedingedge - well, actually two approaches. One is just to build the old style way and include the k-quants stuff unless it's disabled by a feature. The other way is to use cmake to build which makes stuff like compiling with CUDA simpler (although it doesn't help with determining how to link). For the latter approach, I was able to get a ggml_static
target added so you actually don't have to link against llama.cpp when you only need ggml (but it will require linking with C++ because of BLAS stuff).
@nightscape If we want to support the k-quants we probably have to wrap k_quants.h
in a similar way to ggml.h
(see here) and we probably have to compile ggml with activated k_quants in the build.rs
file. If thats implemented we have to extend our enums and extend the quantization logic to call the correct wrapped functions.
llama.cpp
now supports new k-quants quantizations which achieve good model perplexity even in high quantizations. See https://github.com/ggerganov/llama.cpp/pull/1684 .We should also support these new quantization formats.