Multiply hidden_states by normalizer or dividing by it

hkproj / pytorch-paligemma

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation: https://www.youtube.com/watch?v=vAmKB7iPkWw

315 stars 54 forks source link

Multiply hidden_states by normalizer or dividing by it #3

Open George614 opened 1 month ago

George614 commented 1 month ago

Hi Umar,

I absolutely love your YT video explaining the PaliGemma model and thanks for all the good work! I found this line which seems be contradictory to what you said in the video (which is basically to control / reduce its variance such that it does not grow as the text / image embedding dimensions grow). Is this a bug or an intentional scaling for the hidden states?

Best, George

KevinHooah commented 3 weeks ago

I think this is from the HF's gemma implementation. But this is never mentioned in Gemma/Gemma2 technical reports, so I guess it is some magic lol.

MostHumble commented 2 weeks ago

@George614 @KevinHooah probably for similar reasons on why it's done in the attention mechansim: https://sifal.social/posts/Attention-scores,-Scaling-and-Softmax/