state-spaces / mamba

Mamba SSM architecture
Apache License 2.0
12.2k stars 1.03k forks source link

Questions regarding pretrained Mamba2-Attention Hybrid Model #457

Closed vasqu closed 1 month ago

vasqu commented 1 month ago

When inspecting the config of the hybrid model https://huggingface.co/state-spaces/mamba2attn-2.7b/blob/main/config.json, I came up with two questions:

Thanks in advance!

tridao commented 1 month ago

you'll see in the Attn implementation that if num_heads and headdim are both provided (in this case 30 and 128), then the QKV projection will go from 2560 dim to 30 128 = 3840 for each of Q, K, V. This is 50% more heads than usual, the only reason for that is to keep the number of parameters of the attn layer to be around 6 d_model^2, around the same as the mamba layers, so that it's more uniform for comparison. You could have keep 20 heads, headdim=128, and increase the totla number of layers to keep the size the same.

I like rope_emb_dim = 50% of head_dim (similar to GPTNeoX). Probably doens't make a big difference.

vasqu commented 1 month ago

Ah ok, that makes sense. Thanks for clearing this up for me :smile: