Open PhilIp-L-Good opened 1 year ago
Consider the following mathematical formula to reconsider.
If we multiply both the numerator and denominator of the soft-max function by the constant C, we get the following equation:
我不明白您为什么要这么做,可以进一步解释怎么从该数学公式得到将1改为 $e^{-C}$ 的吗
@Devil-SX The reason is that in the original formulation when you subtract the max, it cannot give more than 0.5 attention to the attention sink. But if you replace the 1 with e^c you get:
$\frac{e^{x_i+c}}{e^c + \sum_j e^{x_j+c}} = \frac{e^c e^{x_i}}{e^c + \sum_j e^{x_j}e^c} = \frac{e^c e^{x_i}}{e^c(1 + \sum_j e^{x_j})} = \frac{e^{x_i}}{1 + \sum_j e^{x_j}}$
I agree. The definition of softmax_1 is wrong here. I worked with Evan Miller on this, and saw many people make the same mistake. I implemented the correct version in Flash Attention here: https://github.com/softmax1/Flash-Attention-Softmax-N
The correct definition of softmax one would be:
请参考下面的数学公式来重新考虑:
Upvote & Fund