Question: attn_head_scale with use_scalenorm

lucidrains / x-transformers

A concise but complete full-attention transformer with a set of promising experimental features from various papers

MIT License

4.63k stars 395 forks source link

Question: attn_head_scale with use_scalenorm #200

Closed pfeatherstone closed 11 months ago

pfeatherstone commented 11 months ago

Am I right in thinking that using use_scalenorm == True together with attn_head_scale == True is pointless since ScaleNorm will undo a learned scalar multiplicative parameter like what attn_head_scale does.

lucidrains commented 11 months ago

yea, the attention head scaling actually came from the normformer paper, and is applied to each output head of attention, before the linear combination of the merged heads

i actually saw instabilities when i last tried it, and nobody else i know is using it. perhaps i should remove it. these days, i favor more projecting the original input to those heads and gating the output that way