AVX2 Flash Attention - Githubissues

ikawrakow / ik_llama.cpp

llama.cpp fork with additional SOTA quants and improved performance

MIT License

89 stars 6 forks source link

AVX2 Flash Attention #48

Closed ikawrakow closed 2 months ago

ikawrakow commented 2 months ago

We don't gain as much as on a Zen4 system as there aren't as many vector registers, so we need to load/store data much more often. Still, we do get a small gain in performance.

For now it supports only fp16 kv-cache. Quantized kv-cache will be added later.