In a previous PR I has fused scale - tanh - scale used for "soft-capping" activations into a GGML_OP_SOFTCAP operation. This PR further fuses GGML_OP_SOFTCAP with GGML_OP_SOFT_MAX into a new GGML_OP_SOFT_CAP_MAX operation. This is useful for, e.g., self-attention in the Gemma-2 series of models, and leads to a significant performance increase.
In addition, "soft-capping" is added to flash attention. I see this has also been done in mainline llama.cpp in PR-8542 and PR-9159.
Here some performance comparisons to llama.cpp (build 3631) for Gemma-2-2b on CUDA (RTX-4080), Metal (30-core M2-Max GPU), AVX2 (Ryzen-7950X) and ARM_NEON (M2-Max CPU). The model is quantized with Q4_K_S (the performance gap between this repo and mainline llama.cpp is smaller for this quantization type compared to most other quants).
No Flash attention
backend
ngl
threads
test
t/s (llama.cpp)
t/s (PR)
Speedup
CUDA
100
1
tg128
239.20 ± 0.27
244.47 ± 0.42
1.022
100
1
pp512
18413.90 ± 566
18824.91 ± 480
1.022
100
1
pp2048
17827.18 ± 106
18307.66 ± 77
1.027
100
1
pp8192
8814.67 ± 7.27
11673.96 ± 8.07
1.324
100
1
pp32768
2827.13 ± 12.12
4634.12 ± 4.84
1.639
AVX2
0
4
tg128
32.68 ± 0.08
35.26 ± 0.05
1.079
0
16
pp512
278.34 ± 1.04
620.40 ± 3.24
2.229
0
16
pp2048
217.57 ± 0.70
562.58 ± 2.31
2.586
0
16
pp8192
111.29 ± 0.15
414.44 ± 0.83
3.724
0
16
pp32768
35.78 ± 0.00
199.58 ± 0.00
5.578
Metal
100
8
tg128
88.82 ± 0.19
91.06 ± 0.18
1.025
100
8
pp512
1427.74 ± 1.44
1512.66 ± 0.59
1.059
100
8
pp2048
1363.51 ± 0.62
1456.12 ± 0.73
1.068
100
8
pp8192
1093.02 ± 0.86
1224.56 ± 0.52
1.120
100
8
pp32768
572.65 ± 1.13
728.75 ± 5.56
1.272
ARN_NEON
0
8
tg128
54.06 ± 0.15
62.49 ± 0.18
1.156
0
8
pp512
148.92 ± 0.15
243.09 ± 0.06
1.632
0
8
pp2048
130.66 ± 1.84
226.46 ± 5.41
1.733
0
8
pp8192
97.95 ± 3.57
189.65 ± 4.30
1.936
For very large prompts (pp32768) the performance difference is striking, reaching 5.5X for AVX2!
Flash attention
Flash attention is only useful on CUDA (on the 3 other platforms I have available performance is lower with flash attention), so here only CUDA results:
In a previous PR I has fused
scale - tanh - scale
used for "soft-capping" activations into aGGML_OP_SOFTCAP
operation. This PR further fusesGGML_OP_SOFTCAP
withGGML_OP_SOFT_MAX
into a newGGML_OP_SOFT_CAP_MAX
operation. This is useful for, e.g., self-attention in the Gemma-2 series of models, and leads to a significant performance increase.In addition, "soft-capping" is added to flash attention. I see this has also been done in mainline
llama.cpp
in PR-8542 and PR-9159.Here some performance comparisons to
llama.cpp
(build 3631) for Gemma-2-2b onCUDA
(RTX-4080),Metal
(30-core M2-Max GPU),AVX2
(Ryzen-7950X) andARM_NEON
(M2-Max CPU). The model is quantized withQ4_K_S
(the performance gap between this repo and mainlinellama.cpp
is smaller for this quantization type compared to most other quants).No Flash attention
For very large prompts (pp32768) the performance difference is striking, reaching 5.5X for
AVX2
!Flash attention
Flash attention is only useful on CUDA (on the 3 other platforms I have available performance is lower with flash attention), so here only CUDA results:
40% faster for 32k tokens is quite nice.