Open joyjitkundu032 opened 7 months ago
Same question, any replies?
You are right, Llama2-70B use grouped query, so the parameters of qkov are not 4h^2
, instead qo is still 2h^2
, but kv is kv_heads/attn_heads*2h^2
. So the total parameters of attention are 2h^2+kv_heads/attn_heads*2h^2
Hi All,
I am struggling to get a count of 70B parameters for Llama2-70B model. Here is my calculation:
Attention parameters per layer: 4 x 8192 x 8192 MLP parameters per layer (gate, up and down projection): 3 x 8192 x 28672 80 layers, vocab size 32000 (embedding dim 8192)
Total parameters ~ 80 x (4 x 8192 x 8192 + 3 x 8192 x 28672) + 32000 x 8192 ~ 78B Where am I getting it wrong?
I do get correct count for 13B: Total parameters ~ 40 x (4 x 5120 x 5120 + 3 x 5120 x 13824) + 32000 x 5120 ~ 12.7B
Is it because of grouped query for 70B model?