ikawrakow / ik_llama.cpp

llama.cpp clone with additional SOTA quants and improved CPU performance
MIT License
57 stars 4 forks source link

Adding fused rms_norm #42

Closed ikawrakow closed 1 week ago

ikawrakow commented 1 week ago

Many models have one or more of rms_norm followed by multiplication with a normalization tensor that is (almost) always just a single row. Fusing these two operations into a single op reduces thread synchronization cost and thus has the potential to improve performance, especially for relatively small models.

This PR adds this fused operation with implementations for the CPU, CUDA and Metal. We get about 1% speedup for PP and TG for Gemma2-2b on all implemented platforms. If we look at a tiny model such as the 99M parameter ternary TriLM, performance improvement is in the range of 5-7%.