The main purpose of the previous PR was to try to improve K*Q matrix multiplications for flash attention with Q8_0 quantized k-cache. Sadly, the performance improvement that we got for Q8_0 did not translate into better FA performance. It is a rainy Saturday, so need something to brighten my day. The last PR is very easily applied to Q5_0, so here we are.
The table shows performance comparison to mainline llama.cpp for LLaMA-3.1-8B ona Ryzen-7950X
The main purpose of the previous PR was to try to improve
K*Q
matrix multiplications for flash attention withQ8_0
quantized k-cache. Sadly, the performance improvement that we got forQ8_0
did not translate into better FA performance. It is a rainy Saturday, so need something to brighten my day. The last PR is very easily applied toQ5_0
, so here we are.The table shows performance comparison to mainline
llama.cpp
for LLaMA-3.1-8B ona Ryzen-7950X