Closed LiuRicky closed 1 year ago
I have the ZS result of the 1-epoch post-pretrained CLIP-ViP (B/32): R@1: 31.5 R@5: 53.9 R@10: 63.4 The result is close to CLIP's. One reason is the captions of MSR-VTT have very similar forms to image captions, which are all descriptive text. Another reason is that a wide range of video-language benchmarks do not heavily rely on the understanding of temporality [1]. As a result, the zero-shot performance of an image-language model is already good. However, our results show that post-pretraining can improve the fine-tuning results by a large margin, benefiting from the good representation learned from video-language data.
[1] Buch, Shyamal, et al. "Revisiting the" video" in video-language understanding." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
Could you provide a possible ZS result for CLIP-VIP(B/16)? I would be very grateful
Thanks for you interesting work.
I am curious about the zero-shot performance of your CLIP-ViP on MSR-VTT.
I find that models (e.g. videoCLIP, SimVLP) pre-trained on video-text pairs performs not satisfied compare with image-language countparters(e.g. CLIP, BLIP) on zero-shot transfer to video retrival tasks. How about the zero-shot performance on CLIP-ViP?
Which do you think make this phenomenon happen?