ofirpress / attention_with_linear_biases

Code for the ALiBi method for transformer language models (ICLR 2022)
MIT License
500 stars 38 forks source link

The numerical value of ALiBi attn_mask #9

Closed chijames closed 2 years ago

chijames commented 2 years ago

Hi,

I really like the elegant idea of ALiBi and thanks a lot for open-sourcing the codebase! I have a small question however, which is the actual numerical value of ALiBi attn_mask applied to the causal self attention matrix. In particular, I printed out the attn_mask in fairseq/modules/multihead_attention.py ln 170 and found that the attn_mask is not symmetric wrt the diagonal line, which seems to be different from the description of Figure 3 in the paper.

For example, I got something like: [[0, -inf, -inf, -inf], [0, 0.0039, -inf, -inf] [0, 0.0039, 0.0078, -inf], [0, 0.0039, 0.0078, 0.0117] ] for one attention head.

Any help is greatly appreciated! Thanks!

ofirpress commented 2 years ago

Yes you're correct! We do this because softmax is invariant to adding a constant so what's in the code is equivalent to what's in the paper! It's explained slightly in more detail here: https://nn.labml.ai/transformers/alibi/index.html

jeeyung commented 1 year ago

Hi!

Shouldn't they be negative values? For example, the second row in the above example is [0, -0.0039, -inf, -inf]. Am I missing something?

idontkonwher commented 6 months ago

Hi!

Shouldn't they be negative values? For example, the second row in the above example is [0, -0.0039, -inf, -inf]. Am I missing something?

That is because softmax is translation invariance.