openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
24.55k stars 3.2k forks source link

about position embedding scale #423

Open OliverHuang1220 opened 7 months ago

OliverHuang1220 commented 7 months ago

Thanks to the good work, the position embedding initialization is multiplied by a scaling factor, which is not initialized in the original VIT. It is also mentioned in the paper that "use a slightly different initialization scheme". How should this operation be explained