google / gemma.cpp

lightweight, standalone C++ inference engine for Google's Gemma models.
Apache License 2.0
5.76k stars 487 forks source link

Simplify Attention. #258

Closed copybara-service[bot] closed 2 weeks ago

copybara-service[bot] commented 2 weeks ago

Simplify Attention.

Shared kMHA, reuse from Activations, inline Attn lambda, use QDim as the stride between successive Q.