kyegomez / AttentionIsOFFByOne

Implementation of "Attention Is Off By One" by Evan Miller
MIT License
179 stars 9 forks source link

Discuss: `softmax_one` and `zero_vector` in QuietAttention conflicts? #1

Open Kahsolt opened 1 year ago

Kahsolt commented 1 year ago

just checked the impl of QuietAttention. according to the blog, prepending a zero_vector as a new entry should be equivalent to replacing naive softmax to softmax1 in effectiveness, but not both.

still thinking over the +1 softmax part...

Upvote & Fund

Fund with Polar

Kahsolt commented 1 year ago

We might extend this softmax1 to softmax+c formulated:

this will happen by chance if one forgets to mask the tailing zero padding items (counts up c entries) in his QK matrix (i.e. the attn scores). How these zero paddings would effect transformers still need ablation experiments, but it should not be a real new thing.