Add support for new k-quants quantization format

rustformers / llm

[Unmaintained, see README] An ecosystem of Rust libraries for working with large language models

https://docs.rs/llm/latest/llm/

Apache License 2.0

6.08k stars 362 forks source link

Add support for new k-quants quantization format #301

Closed LLukas22 closed 1 year ago

LLukas22 commented 1 year ago

llama.cpp now supports new k-quants quantizations which achieve good model perplexity even in high quantizations. See https://github.com/ggerganov/llama.cpp/pull/1684 .

We should also support these new quantization formats.

KerfuffleV2 commented 1 year ago

Is this actually not supported yet?

Either way, probably should keep this in mind: https://github.com/ggerganov/llama.cpp/issues/1919

The solution there was to add checks in the quantization tool to prevent trying to quantize those tensors (or models) with k-quants but a project using GGML at a lower level would need to verify that the rows and columns are divisible by QK_K.

nightscape commented 1 year ago

Is someone already on this? I'm a Rust noob and would probably have to chicken out when it gets complicated, but I could at least start with it 😃 I would start by taking all #ifdef GGML_USE_K_QUANTS from llama.cpp, have GPT4 convert them to Rust and try to fit them in the llm codebase.

KerfuffleV2 commented 1 year ago

I think the first thing you'd need to do is to check if the llama.cpp binding llm depends on is compiling with k-quants. If it is, you probably don't have to do anything more than just add the k-quants types to enums where quantization types are currently listed. In the case of the quantization tool (and maybe loading models also) you'd have to check that the tensor rows/columns are divisible by QK_K (you can possibly just hardcode 256 there).

As long as the binding part is against a new enough version of GGML and it's getting compiled with k-quants this might be a simple change.

nightscape commented 1 year ago

@KerfuffleV2 thanks for your input. I might be mistaken/misunderstanding, but I think llm only depends on ggml.c and reimplements the file loading from llama.cpp. At least I'm finding places where the quantization types are listed as enums or constants. @LLukas22 @philpax could you enlighten us?

KerfuffleV2 commented 1 year ago

I might be mistaken/misunderstanding, but I think llm only depends on ggml.c

Sorry, I might have been incorrect there. I thought I saw a pull a while back saying something like it was now depending on llama.cpp.

You could possibly look at the approach I use here: https://github.com/KerfuffleV2/ggml-sys-bleedingedge - well, actually two approaches. One is just to build the old style way and include the k-quants stuff unless it's disabled by a feature. The other way is to use cmake to build which makes stuff like compiling with CUDA simpler (although it doesn't help with determining how to link). For the latter approach, I was able to get a ggml_static target added so you actually don't have to link against llama.cpp when you only need ggml (but it will require linking with C++ because of BLAS stuff).

LLukas22 commented 1 year ago

@nightscape If we want to support the k-quants we probably have to wrap k_quants.h in a similar way to ggml.h (see here) and we probably have to compile ggml with activated k_quants in the build.rs file. If thats implemented we have to extend our enums and extend the quantization logic to call the correct wrapped functions.

nightscape commented 1 year ago

See https://github.com/rustformers/llm/pull/326