Open amandaluof opened 2 years ago
Hi, the Flickr zero-shot evaluation uses the model fine-tuned on COCO.
Hi, the Flickr zero-shot evaluation uses the model fine-tuned on COCO.
Sorry to miss the details you mentioned in the paper. We indeed get the same results, both zero-shot and finetune results, as you reported in the paper. Thanks!
Sorry to miss the details you mentioned in the paper. Please tell me the details. The test result of my section is also relatively low. Can you tell me?
Thanks for your great work and well-written code. We are evaluating the performance of Zero-shot Retrieval based on the checkpoint and evaluation code you provided. Our testing results are as follows and there is about 10 points' gap between those in your paper. We guess there maybe some bug, Could you please supply your evaluation results based on this repo.
vit-large zero-shot retrieval {"val_txt_r1": 91.42011834319527, "val_txt_r5": 97.534516765286, "val_txt_r10": 98.91518737672584, "val_txt_r_mean": 95.95660749506904, "val_img_r1": 79.30966469428007, "val_img_r5": 94.04339250493096, "val_img_r10": 96.44970414201184, "val_img_r_mean": 89.93425378040763, "val_r_mean": 92.94543063773833, "test_txt_r1": 89.9, "test_txt_r5": 98.8, "test_txt_r10": 99.7, "test_txt_r_mean": 96.13333333333333, "test_img_r1": 80.38, "test_img_r5": 94.88, "test_img_r10": 97.12, "test_img_r_mean": 90.79333333333334, "test_r_mean": 93.46333333333334}
vit-large finetune retrieval {"val_txt_r1": 85.99605522682445, "val_txt_r5": 97.33727810650888, "val_txt_r10": 98.22485207100591, "val_txt_r_mean": 93.85272846811307, "val_img_r1": 77.85009861932939, "val_img_r5": 93.68836291913215, "val_img_r10": 96.60749506903353, "val_img_r_mean": 89.38198553583169, "val_r_mean": 91.61735700197238, "test_txt_r1": 85.4, "test_txt_r5": 97.9, "test_txt_r10": 99.0, "test_txt_r_mean": 94.10000000000001, "test_img_r1": 77.72, "test_img_r5": 94.2, "test_img_r10": 96.88, "test_img_r_mean": 89.60000000000001, "test_r_mean": 91.85000000000001}