Closed vasqu closed 1 month ago
you'll see in the Attn implementation that if num_heads and headdim are both provided (in this case 30 and 128), then the QKV projection will go from 2560 dim to 30 128 = 3840 for each of Q, K, V. This is 50% more heads than usual, the only reason for that is to keep the number of parameters of the attn layer to be around 6 d_model^2, around the same as the mamba layers, so that it's more uniform for comparison. You could have keep 20 heads, headdim=128, and increase the totla number of layers to keep the size the same.
I like rope_emb_dim = 50% of head_dim (similar to GPTNeoX). Probably doens't make a big difference.
Ah ok, that makes sense. Thanks for clearing this up for me :smile:
When inspecting the config of the hybrid model https://huggingface.co/state-spaces/mamba2attn-2.7b/blob/main/config.json, I came up with two questions:
Thanks in advance!