I could not find a rigorous definition of the feature norms in the paper. Which layer or block do the tokens originate from?
Regarding the attention maps, I assume that the norms are based on the linearly transformed tokens used to calculate the attention matrices. According to LayerNorm, all tokens should have a norm of $d^{0.5}$. However, Fig. 3 shows that some tokens have norms ranging from 200 to 600, which seems too large for $d^{0.5}$. This confuses me.
Am I misunderstanding something?
Upvote & Fund
We're using Polar.sh so you can upvote and help fund this issue.
We receive the funding once the issue is completed & confirmed by you.
Thank you in advance for helping prioritize & fund our backlog.
I could not find a rigorous definition of the feature norms in the paper. Which layer or block do the tokens originate from? Regarding the attention maps, I assume that the norms are based on the linearly transformed tokens used to calculate the attention matrices. According to LayerNorm, all tokens should have a norm of $d^{0.5}$. However, Fig. 3 shows that some tokens have norms ranging from 200 to 600, which seems too large for $d^{0.5}$. This confuses me. Am I misunderstanding something?
Upvote & Fund