ikawrakow / ik_llama.cpp

llama.cpp fork with additional SOTA quants and improved performance
MIT License
89 stars 6 forks source link

iq4_nl: faster quantization #76

Closed ikawrakow closed 1 month ago

ikawrakow commented 1 month ago

Speeds up CPU flash attention using IQ4_NL.

Of note: I noticed Q8_0 cannot be used for V-cache when head size is not divisible by 128. This is because of

To fix this, one would need to

I don't like this, so will not do.

Considering that the CUDA FA implementation does not support Q8_0 for heads other than 128, I think it is OK to have this limitation on Q8_0 usage for V-cache in the CPU implementation. From my not very thorough experimentation, it seems better/no quantization for K-cache is much more important. In the few models I tried, Q8_0 for K-cache and IQ4_NL for V-cache beets Q5_1 for K- and V-cache by a significant margin while using only 8% more memory.