Closed sinoaidi closed 1 month ago
Yes, but their GPTQ has their groupsize 128 https://huggingface.co/Qwen/Qwen1.5-4B-Chat-GPTQ-Int4/blob/ff03f8a9647d68587c4bc621eeafd61c9df4487b/config.json#L29
While the understanding is, groupsize of 32 would be better.
From the source: https://github.com/ggerganov/llama.cpp/pull/1684
In the existing ggml quantization types we have "type-0" (Q4_0, Q5_0) and "type-1" (Q4_1, Q5_1). In "type-0", weights w are obtained from quants q using w = d q, where d is the block scale. In "type-1", weights are given by w = d q + m, where m is the block minimum.
GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw
There are several ways to increase the model performance of GPTQ, including using w4g64 instead of w4g128, or doing QAT such as EfficientQAT. Another cause is the quality of prompt engineering. Some models, without correct prompt, can output random results.
From our experience, qwen2 GPTQ w4g128 already performs well enough. However, we are still working on merging latest llama.cpp to support qwen2. Track the process through #46 .
@kaleid-liner @BarfingLemurs thank you very much.
Compared with llama.cpp, does tmac lose precision when running quantized models, or does it give the same results? I am running qwen1.5 4bit(https://huggingface.co/Qwen/Qwen1.5-4B-Chat-GPTQ-Int4) now, and I found that the answers given by the model are sometimes wrong, especially in English. like this: