Open George614 opened 1 month ago
I think this is from the HF's gemma implementation. But this is never mentioned in Gemma/Gemma2 technical reports, so I guess it is some magic lol.
@George614 @KevinHooah probably for similar reasons on why it's done in the attention mechansim: https://sifal.social/posts/Attention-scores,-Scaling-and-Softmax/
Hi Umar,
I absolutely love your YT video explaining the PaliGemma model and thanks for all the good work! I found this line which seems be contradictory to what you said in the video (which is basically to control / reduce its variance such that it does not grow as the text / image embedding dimensions grow). Is this a bug or an intentional scaling for the hidden states?
Best, George