Questions regarding pretrained Mamba2-Attention Hybrid Model

vasqu commented 1 month ago

When inspecting the config of the hybrid model https://huggingface.co/state-spaces/mamba2attn-2.7b/blob/main/config.json, I came up with two questions:

Why is the number of heads 30? Wouldn't we usually want it to be divisible, i.e. d_model % num_heads == 0 (in the above config we have 2560 / 30 ~= 85.33). Is there a specific reason for that, maybe a typo (20 instead of 30)?
Is there a reason that the RoPE dim is half that of the head_dim? I'm used to rope_emb_dim == head_dim.

Thanks in advance!

tridao commented 1 month ago

you'll see in the Attn implementation that if num_heads and headdim are both provided (in this case 30 and 128), then the QKV projection will go from 2560 dim to 30 128 = 3840 for each of Q, K, V. This is 50% more heads than usual, the only reason for that is to keep the number of parameters of the attn layer to be around 6 d_model^2, around the same as the mamba layers, so that it's more uniform for comparison. You could have keep 20 heads, headdim=128, and increase the totla number of layers to keep the size the same.

I like rope_emb_dim = 50% of head_dim (similar to GPTNeoX). Probably doens't make a big difference.

vasqu commented 1 month ago

Ah ok, that makes sense. Thanks for clearing this up for me :smile:

state-spaces / mamba

Questions regarding pretrained Mamba2-Attention Hybrid Model #457