Closed fadzaka12 closed 1 year ago
Hi, to maximize the use of CLIP's initialization, we load the weights of [CLS] of CLIP to the first video proxy token. The separation in implementation can easily achieve this.
Do we also need to set the number of video proxy token on the downstream task? It seems from the paper, it is just needed for the post pretraining stage. However, from the configuration file, you set the video proxy token value to 4 for each downstream dataset.
No, the number of video proxy is already fixed after post pre-training stage.
Thank you for the clarification
Congratulation for your paper acceptance in ICLR 2023! Your work is insightful and achieve a significant performance improvement in the video-text retrieval task.
I want to ask regarding the implementation of this video proxy mechanism. In the paper, you mentioned that it is simply a learnable parameter with length M. However, when I look at the code, you define two separate parameters for the video proxy token: class_embedding and added_cls. class_embedding is a 1D vector with the size equal to hidden_size, while added_cls is a 2D matrix with the dimension equal to add_cls_num x hidden_size. CMIIW, but I cannot find any reference in the main paper regarding this.
I have checked the sample configuration on each dataset, and it turns out you set the value of add_cls_num to 3. Does this correlate to 4 Video Proxy Tokens mentioned in the paper: add_cls_num x hidden_size + hidden_size, where add_cls_num = 3? Can you explain the intuition, why we need to separate this into class_embedding and added_cls?
Thank you