Closed pfeatherstone closed 11 months ago
yea, the attention head scaling actually came from the normformer paper, and is applied to each output head of attention, before the linear combination of the merged heads
i actually saw instabilities when i last tried it, and nobody else i know is using it. perhaps i should remove it. these days, i favor more projecting the original input to those heads and gating the output that way
Am I right in thinking that using
use_scalenorm == True
together withattn_head_scale == True
is pointless sinceScaleNorm
will undo a learned scalar multiplicative parameter like whatattn_head_scale
does.