salesforce / ALBEF

Code for ALBEF: a new vision-language pre-training method
BSD 3-Clause "New" or "Revised" License
1.45k stars 193 forks source link

Zero-shot capabilities on ImageNet #119

Open kimihailv opened 1 year ago

kimihailv commented 1 year ago

Hello, I evaluated ALBEF 14M on ImageNetV2 classification task and it showed relatively low accuracy: top1 – 32.9, top5 - 60.7. How do you think what reasons of such results? Much smaller training dataset compared to CLIP?

LiJunnan1992 commented 1 year ago

Hi @kimihailv , we haven't evaluated this result, but yes the zero-shot performance is largely correlated with the training dataset size.

shyammarjit commented 3 months ago

Zero-shot capabilities on other datasets (such as dtd, food101, caltech101, sun397 & etc) is much lower as compared to CLIP, MetaCLIP and open_clip methods.