salesforce / BLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
BSD 3-Clause "New" or "Revised" License
4.86k stars 648 forks source link

performance gap in Flickr retrieval #118

Open amandaluof opened 2 years ago

amandaluof commented 2 years ago

Thanks for your great work and well-written code. We are evaluating the performance of Zero-shot Retrieval based on the checkpoint and evaluation code you provided. Our testing results are as follows and there is about 10 points' gap between those in your paper. We guess there maybe some bug, Could you please supply your evaluation results based on this repo.

vit-large zero-shot retrieval {"val_txt_r1": 91.42011834319527, "val_txt_r5": 97.534516765286, "val_txt_r10": 98.91518737672584, "val_txt_r_mean": 95.95660749506904, "val_img_r1": 79.30966469428007, "val_img_r5": 94.04339250493096, "val_img_r10": 96.44970414201184, "val_img_r_mean": 89.93425378040763, "val_r_mean": 92.94543063773833, "test_txt_r1": 89.9, "test_txt_r5": 98.8, "test_txt_r10": 99.7, "test_txt_r_mean": 96.13333333333333, "test_img_r1": 80.38, "test_img_r5": 94.88, "test_img_r10": 97.12, "test_img_r_mean": 90.79333333333334, "test_r_mean": 93.46333333333334}

vit-large finetune retrieval {"val_txt_r1": 85.99605522682445, "val_txt_r5": 97.33727810650888, "val_txt_r10": 98.22485207100591, "val_txt_r_mean": 93.85272846811307, "val_img_r1": 77.85009861932939, "val_img_r5": 93.68836291913215, "val_img_r10": 96.60749506903353, "val_img_r_mean": 89.38198553583169, "val_r_mean": 91.61735700197238, "test_txt_r1": 85.4, "test_txt_r5": 97.9, "test_txt_r10": 99.0, "test_txt_r_mean": 94.10000000000001, "test_img_r1": 77.72, "test_img_r5": 94.2, "test_img_r10": 96.88, "test_img_r_mean": 89.60000000000001, "test_r_mean": 91.85000000000001}

LiJunnan1992 commented 1 year ago

Hi, the Flickr zero-shot evaluation uses the model fine-tuned on COCO.

amandaluof commented 1 year ago

Hi, the Flickr zero-shot evaluation uses the model fine-tuned on COCO.

Sorry to miss the details you mentioned in the paper. We indeed get the same results, both zero-shot and finetune results, as you reported in the paper. Thanks!

lijain commented 1 year ago

Sorry to miss the details you mentioned in the paper. Please tell me the details. The test result of my section is also relatively low. Can you tell me? image