Explanation regarding multiplying linear biases with q.k^T

ofirpress / attention_with_linear_biases

Code for the ALiBi method for transformer language models (ICLR 2022)

MIT License

497 stars 38 forks source link

Explanation regarding multiplying linear biases with q.k^T #7

Closed sayakpaul closed 2 years ago

sayakpaul commented 2 years ago

From the paper:

But the README recommends multiplying the linear biases with the mask: https://github.com/ofirpress/attention_with_linear_biases/blob/master/fairseq/models/transformer.py#L1022

What I'm struggling to understand is how does that translate to multiplying the linear biases before softmaxing the output of q.k^T.

Thanks in advance.

ofirpress commented 2 years ago

Ya it's just an implementation trick to make runtime faster.

What we want to do is:

attn = q*k + alibi_bias + mask

but to make it faster, basically what we do is:

mask = mask + alibi_bias
attn = q*k + mask

So both approaches are equivalent, but the second runs faster.

sayakpaul commented 2 years ago

Appreciate the clarification thank you.

ofirpress commented 2 years ago

Oh I didn't write this, but the reason this runs faster is because we run the mask = mask + alibi_bias line just once and re-use it many times.

sayakpaul commented 2 years ago

@ofirpress I have another similar question but it's about an implementation detail.

My parameters are the following:

Number of attention heads: 8
Projection dimension: 768
Maximum sequence length of training data: 512
Maximum tokens: 1024

Now, the q.k^T should be (for a batch size of 1): (1, 8, 512, 512) while alibi would be: (16, 1, 512) (as per the calculations shown in the repository).

If I were to do: attention_scores = attention_scores + alibi in this case it won't be broadcastable and naturally will result in an error. Am I missing out on anything here?