google / gemma.cpp

lightweight, standalone C++ inference engine for Google's Gemma models.
Apache License 2.0
5.9k stars 499 forks source link

Use more parallelism in attention block in prefill mode. #177

Closed szabadka closed 4 months ago

szabadka commented 4 months ago

Move the loop over the tokens inside the attention block and then create kHeads * num_tokens threads.

This helps the multi-threaded speed only in case of the 2b gemma model, but to be consistent we move the loop over the tokens inside the griffin recurrent layer and the FFW layer as well. This is also a preparation for using the MatMul operation later.

Benchmark results (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation):

                   Prefill speed
Num threads      BEFORE       AFTER
32               61.76 t/s    65.08 t/s
64               89.46 t/s    98.62 t/s