Closed sayakpaul closed 2 years ago
Ya it's just an implementation trick to make runtime faster.
What we want to do is:
attn = q*k + alibi_bias + mask
but to make it faster, basically what we do is:
mask = mask + alibi_bias
attn = q*k + mask
So both approaches are equivalent, but the second runs faster.
Appreciate the clarification thank you.
Oh I didn't write this, but the reason this runs faster is because we run the
mask = mask + alibi_bias
line just once and re-use it many times.
@ofirpress I have another similar question but it's about an implementation detail.
My parameters are the following:
Now, the q.k^T should be (for a batch size of 1): (1, 8, 512, 512)
while alibi
would be: (16, 1, 512)
(as per the calculations shown in the repository).
If I were to do: attention_scores = attention_scores + alibi
in this case it won't be broadcastable and naturally will result in an error. Am I missing out on anything here?
From the paper:
But the README recommends multiplying the linear biases with the mask: https://github.com/ofirpress/attention_with_linear_biases/blob/master/fairseq/models/transformer.py#L1022
What I'm struggling to understand is how does that translate to multiplying the linear biases before softmaxing the output of q.k^T.
Thanks in advance.