about precision loss - Githubissues

sinoaidi commented 1 month ago

Compared with llama.cpp, does tmac lose precision when running quantized models, or does it give the same results? I am running qwen1.5 4bit（https://huggingface.co/Qwen/Qwen1.5-4B-Chat-GPTQ-Int4） now, and I found that the answers given by the model are sometimes wrong, especially in English. like this: 微信截图_20240926091000

BarfingLemurs commented 1 month ago

Yes, but their GPTQ has their groupsize 128 https://huggingface.co/Qwen/Qwen1.5-4B-Chat-GPTQ-Int4/blob/ff03f8a9647d68587c4bc621eeafd61c9df4487b/config.json#L29

While the understanding is, groupsize of 32 would be better.

From the source: https://github.com/ggerganov/llama.cpp/pull/1684

In the existing ggml quantization types we have "type-0" (Q4_0, Q5_0) and "type-1" (Q4_1, Q5_1). In "type-0", weights w are obtained from quants q using w = d q, where d is the block scale. In "type-1", weights are given by w = d q + m, where m is the block minimum.

Q4_K_M carries a mixture of these complex components:

GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw

kaleid-liner commented 1 month ago

There are several ways to increase the model performance of GPTQ, including using w4g64 instead of w4g128, or doing QAT such as EfficientQAT. Another cause is the quality of prompt engineering. Some models, without correct prompt, can output random results.

From our experience, qwen2 GPTQ w4g128 already performs well enough. However, we are still working on merging latest llama.cpp to support qwen2. Track the process through #46 .

sinoaidi commented 1 month ago

@kaleid-liner @BarfingLemurs thank you very much.

microsoft / T-MAC

about precision loss #52