tairov / llama2.mojo

Inference Llama 2 in one file of pure 🔥
https://www.modular.com/blog/community-spotlight-how-i-built-llama2-by-aydyn-tairov
MIT License
2.09k stars 140 forks source link

Is llama2 a group query attention or multi head attention? #23

Closed magician-blue closed 12 months ago

magician-blue commented 12 months ago

I remember llama2 uses group query attention. In the llama.c, I found there are kv_heads, kv_dim.

tic-top commented 12 months ago

In run.c, I find something like this. // qkv matmuls for this position matmul(s->q, s->xb, w->wq + l*dim*dim, dim, dim); matmul(s->k, s->xb, w->wk + l*dim*kv_dim, dim, kv_dim); matmul(s->v, s->xb, w->wv + l*dim*kv_dim, dim, kv_dim);

tairov commented 12 months ago

Thanks for you question. I found this issue in the original llama repo quite interesting . Probably need to revise & adapt our implementation. @magician-blue did you implement similar approach in you recent PR?

magician-blue commented 12 months ago

24