Closed chijames closed 2 years ago
Yes you're correct! We do this because softmax is invariant to adding a constant so what's in the code is equivalent to what's in the paper! It's explained slightly in more detail here: https://nn.labml.ai/transformers/alibi/index.html
Hi!
Shouldn't they be negative values? For example, the second row in the above example is [0, -0.0039, -inf, -inf]. Am I missing something?
Hi!
Shouldn't they be negative values? For example, the second row in the above example is [0, -0.0039, -inf, -inf]. Am I missing something?
That is because softmax is translation invariance.
Hi,
I really like the elegant idea of ALiBi and thanks a lot for open-sourcing the codebase! I have a small question however, which is the actual numerical value of ALiBi attn_mask applied to the causal self attention matrix. In particular, I printed out the attn_mask in fairseq/modules/multihead_attention.py ln 170 and found that the attn_mask is not symmetric wrt the diagonal line, which seems to be different from the description of Figure 3 in the paper.
For example, I got something like: [[0, -inf, -inf, -inf], [0, 0.0039, -inf, -inf] [0, 0.0039, 0.0078, -inf], [0, 0.0039, 0.0078, 0.0117] ] for one attention head.
Any help is greatly appreciated! Thanks!