salesforce / BLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
BSD 3-Clause "New" or "Revised" License
4.81k stars 642 forks source link

Problem about the text-to-video retrieval zeroshot evaluation? #23

Closed cdqncn closed 2 years ago

cdqncn commented 2 years ago

Dear author,

I tried to use the released checkpoints (model_base_retrieval_coco.pth and model_large_retrieval_coco.pth) to test on the msrvtt retrieval dataset, but I got R1 35.8 and 39.74, respectively. How could I get the 43.3 presented in your arXiv paper? Thanks!

Yours, Sincerely

LiJunnan1992 commented 2 years ago

Hi, The reported result in the paper is for text-to-video retrieval, whereas my evaluation code will output results for both video-to-text and text-to-video. Is it possible that your R1 is for video-to-text?

cdqncn commented 2 years ago

Thanks for your reply!

My R1 is for video-to-text, you are right! I learn a lot from your works: ALBEF and BLIP, and I am looking forward to your code on videoQA. Thanks a lot again!

Yours, Sincerely