Open Kahsolt opened 1 year ago
We might extend this softmax1
to softmax+c
formulated:
this will happen by chance if one forgets to mask the tailing zero padding items (counts up c
entries) in his QK
matrix (i.e. the attn scores
).
How these zero paddings would effect transformers still need ablation experiments, but it should not be a real new thing.
just checked the impl of
QuietAttention
. according to the blog, prepending azero_vector
as a new entry should be equivalent to replacing naive softmax tosoftmax1
in effectiveness, but not both.still thinking over the
+1 softmax
part...Upvote & Fund