About activitynet captions dataset in CLIP-ViP

hello, thank you for sharing your excellent work! I have reproduced result in msrvtt and even acquire a higher result than that in paper.

But when I tried to reproduce on activitynet captions, I found that in actnet_retrieval_vip_base_32.jsonthe vision format setting is frame instead of video, and I tried to reproduce on vision format video with sampling 32 frames setting it almost reach to r@1=20 finally. Then I use opencv library to extract frames but it still can’t reach the result in paper.

microsoft / XPretrain

About activitynet captions dataset in CLIP-ViP #41