Use more parallelism in attention block in prefill mode.

Move the loop over the tokens inside the attention block and then create kHeads * num_tokens threads.

This helps the multi-threaded speed only in case of the 2b gemma model, but to be consistent we move the loop over the tokens inside the griffin recurrent layer and the FFW layer as well. This is also a preparation for using the MatMul operation later.

Benchmark results (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation):

                   Prefill speed
Num threads      BEFORE       AFTER
32               61.76 t/s    65.08 t/s
64               89.46 t/s    98.62 t/s

google / gemma.cpp

Use more parallelism in attention block in prefill mode. #177