hello, for CLIP knowledge distilation paper, i.e.,A Unified View of Masked Image Modeling:
when the teacher is CLIP vit-large/14 for 196's input resolution, and the student is vit-base/16 for 224's input resolution, vit-large/14's positional embed (i.e.,257) for CLIP mismatch with the positional embed of our teacher (i.e., 197). How should I fix this to align with the paper.
Please refer to clip_model in beit v2 codebase for details.
Specifically, the interpolate_pos_encoding function is employed to resize the positional embedding.
hello, for CLIP knowledge distilation paper, i.e.,A Unified View of Masked Image Modeling: when the teacher is CLIP vit-large/14 for 196's input resolution, and the student is vit-base/16 for 224's input resolution, vit-large/14's positional embed (i.e.,257) for CLIP mismatch with the positional embed of our teacher (i.e., 197). How should I fix this to align with the paper.
Thanks very much!