kyegomez / Vit-RGTS

Open source implementation of "Vision Transformers Need Registers"
https://discord.gg/qUtxnK2NMf
MIT License
148 stars 13 forks source link

Get Confused about the L2 Norm #22

Open LUPIN11 opened 2 months ago

LUPIN11 commented 2 months ago

I could not find a rigorous definition of the feature norms in the paper. Which layer or block do the tokens originate from? Regarding the attention maps, I assume that the norms are based on the linearly transformed tokens used to calculate the attention matrices. According to LayerNorm, all tokens should have a norm of $d^{0.5}$. However, Fig. 3 shows that some tokens have norms ranging from 200 to 600, which seems too large for $d^{0.5}$. This confuses me. Am I misunderstanding something?

Upvote & Fund

Fund with Polar

github-actions[bot] commented 2 days ago

Stale issue message