Performance improvements for legacy quants on ARM_NEON

If we process 2 rows in the left matrix at a time we get in the range of 20% performance boost for PP-512 (except for Q8_0, where performance was already higher than the other quants). The table summarizes the results or LLaMA-3.1-8B on an M2-Max CPU. As I like keeping track of how we perform relative to mainline llama.cpp, the table includes results for the current llama.cpp build (69a480a (3660)). tinyBLAS is enabled in llama.cpp, so the 33% (Q4_0) or 16.6% (Q8_0) improvement is compared to tinyBLAS, which does not provide implementation for Q4_1, Q5_0 and Q5_1 (and correspondingly the performance gap is much larger).

Quants	t/s (llama.cpp)	t/s (main)	t/s (PR)	Speedup vs main	Speedup vs llama.cpp
Q4_0	65.45 ± 0.01	72.88 ± 0.61	87.22 ± 0.85	1.197	1.333
Q4_1	35.18 ± 0.51	59.95 ± 1.26	73.87 ± 0.47	1.232	2.100
Q5_0	26.69 ± 0.35	62.63 ± 1.47	74.32 ± 0.13	1.187	2.785
Q5_1	23.33 ± 0.06	52.83 ± 1.32	60.79 ± 0.19	1.151	2.606
Q8_0	75.44 ± 1.84	85.08 ± 1.74	88.01 ± 0.11	1.034	1.166

ikawrakow / ik_llama.cpp

Performance improvements for legacy quants on ARM_NEON #37