kyegomez / AttentionIsOFFByOne

Implementation of "Attention Is Off By One" by Evan Miller
MIT License
179 stars 9 forks source link

IMPORTANT:The definition of the softmax one is wrong #5

Open PhilIp-L-Good opened 1 year ago

PhilIp-L-Good commented 1 year ago

The correct definition of softmax one would be: image

请参考下面的数学公式来重新考虑: image

Upvote & Fund

Fund with Polar

PhilIp-L-Good commented 1 year ago

Consider the following mathematical formula to reconsider.

If we multiply both the numerator and denominator of the soft-max function by the constant C, we get the following equation: image

Devil-SX commented 1 year ago

我不明白您为什么要这么做,可以进一步解释怎么从该数学公式得到将1改为 $e^{-C}$ 的吗

martindbp commented 10 months ago

@Devil-SX The reason is that in the original formulation when you subtract the max, it cannot give more than 0.5 attention to the attention sink. But if you replace the 1 with e^c you get:

$\frac{e^{x_i+c}}{e^c + \sum_j e^{x_j+c}} = \frac{e^c e^{x_i}}{e^c + \sum_j e^{x_j}e^c} = \frac{e^c e^{x_i}}{e^c(1 + \sum_j e^{x_j})} = \frac{e^{x_i}}{1 + \sum_j e^{x_j}}$

christopher-w-murphy commented 7 months ago

I agree. The definition of softmax_1 is wrong here. I worked with Evan Miller on this, and saw many people make the same mistake. I implemented the correct version in Flash Attention here: https://github.com/softmax1/Flash-Attention-Softmax-N