Open erlebach opened 4 months ago
same question, with the original code in the class MultiHeadAttention
in mha.py. Cause the following logic, the softmax will operate cross batch, which I don't understand. Need help.
# the defination of softmax
self.softmax = nn.Softmax(dim=1)
# the usage of softmax
attn = self.softmax(scores) # here scores have a shape of [seq_q, seq_k, heads, d_k]
I wonder why array shapes in aha are (C, B, D) rather than (B, C, D). I thought it was convention that the batch was the first dimension. Specially, here are the first few lines of the
forward
method of classMultiHeadAttention
:Thanks.