Question regarding video proxy mechanism in CLIP-ViP

fadzaka12 commented 1 year ago

Congratulation for your paper acceptance in ICLR 2023! Your work is insightful and achieve a significant performance improvement in the video-text retrieval task.

I want to ask regarding the implementation of this video proxy mechanism. In the paper, you mentioned that it is simply a learnable parameter with length M. However, when I look at the code, you define two separate parameters for the video proxy token: class_embedding and added_cls. class_embedding is a 1D vector with the size equal to hidden_size, while added_cls is a 2D matrix with the dimension equal to add_cls_num x hidden_size. CMIIW, but I cannot find any reference in the main paper regarding this.

I have checked the sample configuration on each dataset, and it turns out you set the value of add_cls_num to 3. Does this correlate to 4 Video Proxy Tokens mentioned in the paper: add_cls_num x hidden_size + hidden_size, where add_cls_num = 3? Can you explain the intuition, why we need to separate this into class_embedding and added_cls?

Thank you

HellwayXue commented 1 year ago

Hi, to maximize the use of CLIP's initialization, we load the weights of [CLS] of CLIP to the first video proxy token. The separation in implementation can easily achieve this.

fadzaka12 commented 1 year ago

Do we also need to set the number of video proxy token on the downstream task? It seems from the paper, it is just needed for the post pretraining stage. However, from the configuration file, you set the video proxy token value to 4 for each downstream dataset.

HellwayXue commented 1 year ago

No, the number of video proxy is already fixed after post pre-training stage.

fadzaka12 commented 1 year ago

Thank you for the clarification

microsoft / XPretrain

Question regarding video proxy mechanism in CLIP-ViP #16