I am struggling to get a count of 70B parameters for Llama2-70B model. Here is my calculation:
Attention parameters per layer: 4 x 8192 x 8192
MLP parameters per layer (gate, up and down projection): 3 x 8192 x 28672
80 layers, vocab size 32000 (embedding dim 8192)
Total parameters ~ 80 x (4 x 8192 x 8192 + 3 x 8192 x 28672) + 32000 x 8192 ~ 78B
Where am I getting it wrong?
I do get correct count for 13B:
Total parameters ~ 40 x (4 x 5120 x 5120 + 3 x 5120 x 13824) + 32000 x 5120 ~ 12.7B
Hi All,
I am struggling to get a count of 70B parameters for Llama2-70B model. Here is my calculation:
Attention parameters per layer: 4 x 8192 x 8192 MLP parameters per layer (gate, up and down projection): 3 x 8192 x 28672 80 layers, vocab size 32000 (embedding dim 8192)
Total parameters ~ 80 x (4 x 8192 x 8192 + 3 x 8192 x 28672) + 32000 x 8192 ~ 78B Where am I getting it wrong?
I do get correct count for 13B: Total parameters ~ 40 x (4 x 5120 x 5120 + 3 x 5120 x 13824) + 32000 x 5120 ~ 12.7B
Is it because of grouped query for 70B model?