parameter count of Llama2-70B and Llama2-13B

joyjitkundu032 commented 7 months ago

Hi All,

I am struggling to get a count of 70B parameters for Llama2-70B model. Here is my calculation:

Attention parameters per layer: 4 x 8192 x 8192 MLP parameters per layer (gate, up and down projection): 3 x 8192 x 28672 80 layers, vocab size 32000 (embedding dim 8192)

Total parameters ~ 80 x (4 x 8192 x 8192 + 3 x 8192 x 28672) + 32000 x 8192 ~ 78B Where am I getting it wrong?

I do get correct count for 13B: Total parameters ~ 40 x (4 x 5120 x 5120 + 3 x 5120 x 13824) + 32000 x 5120 ~ 12.7B

Is it because of grouped query for 70B model?

konnase commented 2 months ago

Same question, any replies?

konnase commented 2 months ago

You are right, Llama2-70B use grouped query, so the parameters of qkov are not 4h^2, instead qo is still 2h^2, but kv is kv_heads/attn_heads*2h^2. So the total parameters of attention are 2h^2+kv_heads/attn_heads*2h^2

meta-llama / llama

parameter count of Llama2-70B and Llama2-13B #1111

I am struggling to get a count of 70B parameters for Llama2-70B model. Here is my calculation:

Total parameters ~ 80 x (4 x 8192 x 8192 + 3 x 8192 x 28672) + 32000 x 8192 ~ 78B Where am I getting it wrong?