microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.48k stars 2.48k forks source link

[unimim] mismatched positional_embed about vit-large/14 for input resolution with 196 #921

Open futureisatyourhand opened 1 year ago

futureisatyourhand commented 1 year ago

hello, for CLIP knowledge distilation paper, i.e.,A Unified View of Masked Image Modeling: when the teacher is CLIP vit-large/14 for 196's input resolution, and the student is vit-base/16 for 224's input resolution, vit-large/14's positional embed (i.e.,257) for CLIP mismatch with the positional embed of our teacher (i.e., 197). How should I fix this to align with the paper.

Thanks very much!

pengzhiliang commented 1 year ago

Take it easy.

Please refer to clip_model in beit v2 codebase for details. Specifically, the interpolate_pos_encoding function is employed to resize the positional embedding.