hello, thank you for sharing your excellent work!
I have reproduced result in msrvtt and even acquire a higher result than that in paper.
But when I tried to reproduce on activitynet captions, I found that in actnet_retrieval_vip_base_32.jsonthe vision format setting is frame instead of video, and I tried to reproduce on vision format video with sampling 32 frames setting it almost reach to r@1=20 finally.
Then I use opencv library to extract frames but it still can’t reach the result in paper.
hello, thank you for sharing your excellent work! I have reproduced result in msrvtt and even acquire a higher result than that in paper.
But when I tried to reproduce on activitynet captions, I found that in
actnet_retrieval_vip_base_32.json
the vision format setting is frame instead of video, and I tried to reproduce on vision formatvideo
with sampling 32 frames setting it almost reach to r@1=20 finally. Then I use opencv library to extract frames but it still can’t reach the result in paper.