ikawrakow / ik_llama.cpp

llama.cpp fork with additional SOTA quants and improved performance
MIT License
89 stars 6 forks source link

Improve Q4_0 and Q8_0 performance on AVX2/Zen4 #54

Closed ikawrakow closed 1 month ago

ikawrakow commented 1 month ago

This PR improves Q4_0 and Q8_0 performance on AVX2 and Zen4. The table shows comparisons to llama.cpp for LLaMA-3.1-8B on a Ryzen-7950X (Zen4) and a Ryzen-5975WX (AVX2) CPU.

model backend threads test t/s (llama.cpp) t/s (PR) Speedup
llama 8B Q4_0 Zen4 16 pp512 123.46 ± 0.09 165.26 ± 0.54 1.339
llama 8B Q8_0 Zen4 16 pp512 141.30 ± 0.86 169.26 ± 0.57 1.200
llama 8B Q4_0 Zen4 4 tg128 11.25 ± 0.02 13.88 ± 0.01 1.234
llama 8B Q8_0 Zen4 4 tg128 7.56 ± 0.01 7.79 ± 0.02 1.030
llama 8B Q4_0 AVX2 32 pp512 139.09 ± 0.62 212.70 ± 0.82 1.529
llama 8B Q8_0 AVX2 32 pp512 162.21 ± 0.42 217.14 ± 0.65 1.339
llama 8B Q4_0 AVX2 8 tg128 11.90 ± 0.00 11.99 ± 0.00 1.008
llama 8B Q8_0 AVX2 8 tg128 8.13 ± 0.00 8.21 ± 0.00 1.010