ikawrakow / ik_llama.cpp

llama.cpp clone with additional SOTA quants and improved CPU performance
MIT License
57 stars 4 forks source link

Faster Gemma2 #27

Closed ikawrakow closed 3 weeks ago

ikawrakow commented 3 weeks ago

In a previous PR I has fused scale - tanh - scale used for "soft-capping" activations into a GGML_OP_SOFTCAP operation. This PR further fuses GGML_OP_SOFTCAP with GGML_OP_SOFT_MAX into a new GGML_OP_SOFT_CAP_MAX operation. This is useful for, e.g., self-attention in the Gemma-2 series of models, and leads to a significant performance increase.

In addition, "soft-capping" is added to flash attention. I see this has also been done in mainline llama.cpp in PR-8542 and PR-9159.

Here some performance comparisons to llama.cpp (build 3631) for Gemma-2-2b on CUDA (RTX-4080), Metal (30-core M2-Max GPU), AVX2 (Ryzen-7950X) and ARM_NEON (M2-Max CPU). The model is quantized with Q4_K_S (the performance gap between this repo and mainline llama.cpp is smaller for this quantization type compared to most other quants).

No Flash attention

backend ngl threads test t/s (llama.cpp) t/s (PR) Speedup
CUDA 100 1 tg128 239.20 ± 0.27 244.47 ± 0.42 1.022
100 1 pp512 18413.90 ± 566 18824.91 ± 480 1.022
100 1 pp2048 17827.18 ± 106 18307.66 ± 77 1.027
100 1 pp8192 8814.67 ± 7.27 11673.96 ± 8.07 1.324
100 1 pp32768 2827.13 ± 12.12 4634.12 ± 4.84 1.639
AVX2 0 4 tg128 32.68 ± 0.08 35.26 ± 0.05 1.079
0 16 pp512 278.34 ± 1.04 620.40 ± 3.24 2.229
0 16 pp2048 217.57 ± 0.70 562.58 ± 2.31 2.586
0 16 pp8192 111.29 ± 0.15 414.44 ± 0.83 3.724
0 16 pp32768 35.78 ± 0.00 199.58 ± 0.00 5.578
Metal 100 8 tg128 88.82 ± 0.19 91.06 ± 0.18 1.025
100 8 pp512 1427.74 ± 1.44 1512.66 ± 0.59 1.059
100 8 pp2048 1363.51 ± 0.62 1456.12 ± 0.73 1.068
100 8 pp8192 1093.02 ± 0.86 1224.56 ± 0.52 1.120
100 8 pp32768 572.65 ± 1.13 728.75 ± 5.56 1.272
ARN_NEON 0 8 tg128 54.06 ± 0.15 62.49 ± 0.18 1.156
0 8 pp512 148.92 ± 0.15 243.09 ± 0.06 1.632
0 8 pp2048 130.66 ± 1.84 226.46 ± 5.41 1.733
0 8 pp8192 97.95 ± 3.57 189.65 ± 4.30 1.936

For very large prompts (pp32768) the performance difference is striking, reaching 5.5X for AVX2!

Flash attention

Flash attention is only useful on CUDA (on the 3 other platforms I have available performance is lower with flash attention), so here only CUDA results:

backend ngl threads fa test t/s (llama.cpp) t/s (PR) Speedup
CUDA 100 1 1 tg128 251.86 ± 0.56 256.15 ± 0.76 1.017
CUDA 100 1 1 pp512 19127.14 ± 529.58 19712.11 ± 167.06 1.031
CUDA 100 1 1 pp2048 18641.99 ± 72.13 19823.18 ± 91.26 1.063
CUDA 100 1 1 pp8192 13566.85 ± 111.75 16108.68 ± 30.32 1.187
CUDA 100 1 1 pp32768 6472.16 ± 4.43 9053.46 ± 9.68 1.399

40% faster for 32k tokens is quite nice.