ikawrakow / ik_llama.cpp

llama.cpp fork with additional SOTA quants and improved performance
MIT License
89 stars 6 forks source link

Improve Q5_0 performance on AVX2 #55

Closed ikawrakow closed 1 month ago

ikawrakow commented 1 month ago

The main purpose of the previous PR was to try to improve K*Q matrix multiplications for flash attention with Q8_0 quantized k-cache. Sadly, the performance improvement that we got for Q8_0 did not translate into better FA performance. It is a rainy Saturday, so need something to brighten my day. The last PR is very easily applied to Q5_0, so here we are.

The table shows performance comparison to mainline llama.cpp for LLaMA-3.1-8B ona Ryzen-7950X

model backend threads test t/s (llama.cpp) t/s (PR) Speedup
llama 8B Q5_0 CPU 16 pp512 55.72 ± 0.25 152.10 ± 0.74 2.793
llama 8B Q5_0 CPU 2 tg128 5.22 ± 0.01 8.88 ± 0.01 1.701
llama 8B Q5_0 CPU 4 tg128 9.24 ± 0.01 11.57 ± 0.00 1.252