Question: initialization for the case of multi-head attention

Hi, thanks for the good paper and for releasing the code. I'm reading both the paper and code, and I've got questions regarding the case of multi-head attention.

It seems that the code exactly follows the initialization procedures explained in Sec 3.2 (ie, initialize weights by Xavier and scale v by 0.67 *N**(-1/4) or (9*N)**(-1/4)), even for multi-head attention. But when v (v_proj.weight) is defined by a shape (embed_dim, embed_dim), then xavier_uniform_ initializes it by a uniform distribution U(-a, a), where a = sqrt(6/(fan_in + fan_out)) = sqrt(6/(embed_dim + embed_dim)). But in multi-head attention, v is actually used as multiple (num_heads) matrices with a shape (head_dim, embed_dim). In this case, shouldn't v be initialized by U(-a, a) where a = sqrt(6/(embed_dim + head_dim)) ....? When num_heads=8, this initialization increases the scale of v's weights by 4/3 (= sqrt(2/(1+1/num_heads)).

layer6ai-labs / T-Fixup

Question: initialization for the case of multi-head attention #8