Discuss: `softmax_one` and `zero_vector` in QuietAttention conflicts?

kyegomez / AttentionIsOFFByOne

Implementation of "Attention Is Off By One" by Evan Miller

MIT License

179 stars 9 forks source link

Discuss: `softmax_one` and `zero_vector` in QuietAttention conflicts? #1

Open Kahsolt opened 1 year ago

Kahsolt commented 1 year ago

just checked the impl of QuietAttention. according to the blog, prepending a zero_vector as a new entry should be equivalent to replacing naive softmax to softmax1 in effectiveness, but not both.

still thinking over the +1 softmax part...

Upvote & Fund

We're using Polar.sh so you can upvote and help fund this issue.
We receive the funding once the issue is completed & confirmed by you.
Thank you in advance for helping prioritize & fund our backlog.

Kahsolt commented 1 year ago

We might extend this softmax1 to softmax+c formulated:

$\frac{e^{x_j}}{c+\sum_{i=1}^k e^{x_i}}$

this will happen by chance if one forgets to mask the tailing zero padding items (counts up c entries) in his QK matrix (i.e. the attn scores). How these zero paddings would effect transformers still need ablation experiments, but it should not be a real new thing.