iq2_k: slightly better bpw - accuracy compromise

ikawrakow / ik_llama.cpp

llama.cpp fork with additional SOTA quants and improved performance

MIT License

89 stars 6 forks source link

Closed ikawrakow closed 2 months ago

ikawrakow commented 2 months ago

For LLaMA-3.1 models:

It is better to quantize all of attn_v with iq3_k instead of half of attn_v with iq4_k
Quantizing attn_output with iq3_k results in a larger PPL decrease compared to what one expects from the added bpw.