Faster Gemma2 - Githubissues

In a previous PR I has fused scale - tanh - scale used for "soft-capping" activations into a GGML_OP_SOFTCAP operation. This PR further fuses GGML_OP_SOFTCAP with GGML_OP_SOFT_MAX into a new GGML_OP_SOFT_CAP_MAX operation. This is useful for, e.g., self-attention in the Gemma-2 series of models, and leads to a significant performance increase.

In addition, "soft-capping" is added to flash attention. I see this has also been done in mainline llama.cpp in PR-8542 and PR-9159.

Here some performance comparisons to llama.cpp (build 3631) for Gemma-2-2b on CUDA (RTX-4080), Metal (30-core M2-Max GPU), AVX2 (Ryzen-7950X) and ARM_NEON (M2-Max CPU). The model is quantized with Q4_K_S (the performance gap between this repo and mainline llama.cpp is smaller for this quantization type compared to most other quants).

No Flash attention

backend	ngl	threads	test	t/s (llama.cpp)	t/s (PR)	Speedup
CUDA	100	1	tg128	239.20 ± 0.27	244.47 ± 0.42	1.022
	100	1	pp512	18413.90 ± 566	18824.91 ± 480	1.022
	100	1	pp2048	17827.18 ± 106	18307.66 ± 77	1.027
	100	1	pp8192	8814.67 ± 7.27	11673.96 ± 8.07	1.324
	100	1	pp32768	2827.13 ± 12.12	4634.12 ± 4.84	1.639
AVX2	0	4	tg128	32.68 ± 0.08	35.26 ± 0.05	1.079
	0	16	pp512	278.34 ± 1.04	620.40 ± 3.24	2.229
	0	16	pp2048	217.57 ± 0.70	562.58 ± 2.31	2.586
	0	16	pp8192	111.29 ± 0.15	414.44 ± 0.83	3.724
	0	16	pp32768	35.78 ± 0.00	199.58 ± 0.00	5.578
Metal	100	8	tg128	88.82 ± 0.19	91.06 ± 0.18	1.025
	100	8	pp512	1427.74 ± 1.44	1512.66 ± 0.59	1.059
	100	8	pp2048	1363.51 ± 0.62	1456.12 ± 0.73	1.068
	100	8	pp8192	1093.02 ± 0.86	1224.56 ± 0.52	1.120
	100	8	pp32768	572.65 ± 1.13	728.75 ± 5.56	1.272
ARN_NEON	0	8	tg128	54.06 ± 0.15	62.49 ± 0.18	1.156
	0	8	pp512	148.92 ± 0.15	243.09 ± 0.06	1.632
	0	8	pp2048	130.66 ± 1.84	226.46 ± 5.41	1.733
	0	8	pp8192	97.95 ± 3.57	189.65 ± 4.30	1.936

For very large prompts (pp32768) the performance difference is striking, reaching 5.5X for AVX2!

Flash attention

Flash attention is only useful on CUDA (on the 3 other platforms I have available performance is lower with flash attention), so here only CUDA results:

backend	ngl	threads	fa	test	t/s (llama.cpp)	t/s (PR)	Speedup
CUDA	100	1	1	tg128	251.86 ± 0.56	256.15 ± 0.76	1.017
CUDA	100	1	1	pp512	19127.14 ± 529.58	19712.11 ± 167.06	1.031
CUDA	100	1	1	pp2048	18641.99 ± 72.13	19823.18 ± 91.26	1.063
CUDA	100	1	1	pp8192	13566.85 ± 111.75	16108.68 ± 30.32	1.187
CUDA	100	1	1	pp32768	6472.16 ± 4.43	9053.46 ± 9.68	1.399

40% faster for 32k tokens is quite nice.

ikawrakow / ik_llama.cpp

Faster Gemma2 #27

No Flash attention

Flash attention