salesforce / ALBEF

Code for ALBEF: a new vision-language pre-training method
BSD 3-Clause "New" or "Revised" License
1.46k stars 193 forks source link

Image-Text Retrieval Task, ITC score for ranking #74

Closed yxoh closed 2 years ago

yxoh commented 2 years ago

I saw the original setting use the ITM score s{itm} for ranking, but it has more calculations. Is it ok that we only use feature similarity score s{itc} for ranking during inference?

LiJunnan1992 commented 2 years ago

Yes, using ITC is faster but less accurate.

yxoh commented 2 years ago

ut less accurate

Are there experiments that show how much the accuracy rate has dropped?

yxoh commented 2 years ago

ut less accurate

Are there experiments that show how much the accuracy rate has dropped?

Ah, I saw the experiment in the paper:)

yxoh commented 2 years ago

I downloaded the data JSON files and pretrained ALBEF model (4M) from this repo. I run an image-text retrieval task. The zero-shot results on the flickr30k dataset are TR (R@1: 84.9, R@5: 97.2, R@10:99.0); IR (R@1: 68.18, R@5: 88.58, R@10: 93.02). But in the paper, the results are TR (R@1: 90.5, R@5: 98.8, R@10:99.7); IR (R@1: 76.8, R@5: 93.7, R@10: 96.7). How can I reproduce the same results in the paper?

LiJunnan1992 commented 2 years ago

The flickr zero-shot results are obtained using the coco-finetuned model

yxoh commented 2 years ago

The flickr zero-shot results are obtained using the coco-finetuned model

It helps me. Thanks:)