google / gemma.cpp

lightweight, standalone C++ inference engine for Google's Gemma models.
Apache License 2.0
5.9k stars 499 forks source link

Use more parallelism in the QKV projections of the MHA block. #176

Closed szabadka closed 4 months ago

szabadka commented 4 months ago

We compute all three projections with one MatVec and then copy the kv part to the cache.

Benchmark results for 7b-it model that uses MHA blocks (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation):

                   Prefill speed                Generation speed
Num threads      BEFORE       AFTER            BEFORE       AFTER
32               13.75 t/s    14.80 t/s       9.22 t/s     9.77 t/s
64               19.89 t/s    24.83 t/s      12.46 t/s    13.66 t/s
szabadka commented 4 months ago

Great to get rid of MatVecLoop! Just to confirm that the MQA 2B still works?

Yes, I tested that it still works.