Some models use a so called "soft cap" in their attention portions, some may use a "soft cap" also for the final output. This is currently implemented as
x = ggml_scale(x, 1/softcap_parameter)
x = ggml_tanh(x)
x = ggml_scale(x, softcap_parameter)
By fusing these 3 operations into a single kernel, we gain about 1% on all tested backends (AVX2, NEON, CUDA, Metal).
Also added a SIMD-ified implementation of GeLU (AVX512, AVX2, NEON). This gives another ~1% performance gain on AVX512/AVX2. The ggml GeLU lookup table is faster on my M2-Max CPU, so using that on NEON.
The above is based on just checking the PP-512 and TG-128 performance. But soft cap is used in the attention portion of Gemma-2 models, so let's look at a large context where self-attention plays a more significant role. I'll use Gemma-2-9b and a context of 8192 tokens, but instead of comparing to the main branch in this repository I'll compare against the current mainline llama.cpp version. The following table compares PP-8192 performance for AVX2 (Ryzen-7950X), CUDA (RTX-4080), ARM_NEON (M2-Max CPU), and Metal (30-core M2-Max GPU). To keep the table small, results are given just for Q4_K_S quantization
backend
test
t/s (llama.cpp)
t/s (this PR)
Speedup
AVX2
pp8192
32.90 ± 0.00
103.16 ± 0.00
3.136
CUDA
pp8192
2495.19 ± 1.20
3068.44 ± 0.68
1.230
NEON
pp8192
26.44 ± 0.00
48.30 ± 0.00
1.827
Metal
pp8192
294.33 ± 0.40
325.78 ± 1.94
1.107
As I have not changed much in the CUDA and Metal back-ends, the 23% (CUDA) or 10% (Metal) performance difference comes from this one fused operation! On AVX2 the performance gap has grown to 3.136X up from the 1.874X we had from the improved matrix multiplications (see 1st table on the main page). On ARM_NEON this implementation is now 1.827X faster, up from 1.639X. I think that the much larger increase in relative performance on the Ryzen-7950X can be explained with its less capable memory subsystem: for a context of 8192 tokens the K*Q tensor on which the soft-cap is applied no longer fits in the cache, so the ggml_scale + ggml_tanh + ggml_scale implementation in llama.cpp requires it to be loaded from / stored to main memory 3 times instead of just once when these 3 operations are fused into a single op.
Some models use a so called "soft cap" in their attention portions, some may use a "soft cap" also for the final output. This is currently implemented as
By fusing these 3 operations into a single kernel, we gain about 1% on all tested backends (
AVX2, NEON, CUDA, Metal
).Also added a SIMD-ified implementation of GeLU (
AVX512, AVX2, NEON
). This gives another ~1% performance gain onAVX512/AVX2
. Theggml
GeLU lookup table is faster on my M2-Max CPU, so using that onNEON
.The above is based on just checking the
PP-512
andTG-128
performance. But soft cap is used in the attention portion of Gemma-2 models, so let's look at a large context where self-attention plays a more significant role. I'll use Gemma-2-9b and a context of 8192 tokens, but instead of comparing to the main branch in this repository I'll compare against the current mainlinellama.cpp
version. The following table comparesPP-8192
performance forAVX2
(Ryzen-7950X),CUDA
(RTX-4080),ARM_NEON
(M2-Max CPU), andMetal
(30-core M2-Max GPU). To keep the table small, results are given just forQ4_K_S
quantizationAs I have not changed much in the
CUDA
andMetal
back-ends, the 23% (CUDA
) or 10% (Metal
) performance difference comes from this one fused operation! OnAVX2
the performance gap has grown to 3.136X up from the 1.874X we had from the improved matrix multiplications (see 1st table on the main page). OnARM_NEON
this implementation is now 1.827X faster, up from 1.639X. I think that the much larger increase in relative performance on the Ryzen-7950X can be explained with its less capable memory subsystem: for a context of 8192 tokens theK*Q
tensor on which the soft-cap is applied no longer fits in the cache, so theggml_scale + ggml_tanh + ggml_scale
implementation inllama.cpp
requires it to be loaded from / stored to main memory 3 times instead of just once when these 3 operations are fused into a single op.