om-ai-lab / VL-CheckList

Evaluating Vision & Language Pretraining Models with Objects, Attributes and Relations. [EMNLP 2022]
126 stars 4 forks source link

Reproducing CLIP score in the paper #12

Open kkjh0723 opened 1 year ago

kkjh0723 commented 1 year ago

Hi,

Thanks for opening the source code. I'm trying to reproduce the scores for CLIP in the paper but fail to reproduce it. I use the sample config file by changing MODE_NAME to CLIP (ViT-L/14). I evaluate all the datasets in the corpus then average the final accuracy. I got the following score which is quite different from the paper,

Object: 0.8205209550766983
Attribute: 0.6806109948697314
Relation: 0.67975

How can I reproduce the scores in the paper?

ayushchakravarthy commented 1 year ago

Hi, @kkjh0723

Did you have to make any changes to the code in order to get it working? I am also trying to replicate the CLIP result but unable to do so.

Thanks!

kkjh0723 commented 1 year ago

@ayushchakravarthy , If I remember correctly, there were some minor changes required to run CLIP.

In the following lines, I changed result_tmp[i][0][1] to result_tmp[i][0][0] and result_tmp[i][1][1] to result_tmp[i][1][0].

Also, in this lines, I changed it as following,

sample_t = random.sample(sample_true,self.sample_num if len(sample_true)>self.sample_num else len(sample_true))
sample_f = random.sample(sample_false,self.sample_num if len(sample_false)>self.sample_num else len(sample_false))
feilvvl commented 11 months ago

Hi,@kkjh0723 Have you reproduced the results of this work? I have tried many times, but the end result is not satisfactory. I used the CLIP(ViT-B/32) as my model. And I select the "ITM" task to test. For the final average scores, Attribute : 68.6477405706409 Relation : 74.7221415628598 Object : 89.4515112110188 The result is much higher than the paper. So I'd like to know how much data you used, since your results don't vary that much. Thank you!