ikawrakow / ik_llama.cpp

llama.cpp clone with additional SOTA quants and improved CPU performance
MIT License
57 stars 4 forks source link

Quantized Flash Attention for all supported CPU platforms #51

Closed ikawrakow closed 1 week ago

ikawrakow commented 1 week ago

This PR adds two features:

The second bullet leads to performance improvements that increase with context length. The following graph shows an example of prompt processing speed for Q4_K_S-quantized LLaMA-3.1-8B as a function of prompt length. The orange curve is the new implementation in this PR of cache quantized with Q8_0. Results are on a Ryzen-7950X CPU (Zen4). At 32k tokens we now have 91.4 t/s vs 64.4 t.s without FA, so a 42% improvement in the quest to improve CPU performance for large contexts. I did not have the patience to wait for mainline llama.cpp to complete processing 32k tokens, but at the longest context of 8k tokens where my patience was not exhausted, we are now 2.2X faster compared to no-FA, and 3X faster compared to FA.

fa_q