microsoft / XPretrain

Multi-modality pre-training
Other
471 stars 37 forks source link

About activitynet captions dataset in CLIP-ViP #41

Open musicman217 opened 5 months ago

musicman217 commented 5 months ago

hello, thank you for sharing your excellent work! I have reproduced result in msrvtt and even acquire a higher result than that in paper.

But when I tried to reproduce on activitynet captions, I found that in actnet_retrieval_vip_base_32.jsonthe vision format setting is frame instead of video, and I tried to reproduce on vision format video with sampling 32 frames setting it almost reach to r@1=20 finally. Then I use opencv library to extract frames but it still can’t reach the result in paper.