Correlations on Composite

jmhessel / clipscore

CLIPScore EMNLP code

MIT License

194 stars 25 forks source link

Correlations on Composite #13

Closed akskuchi closed 1 year ago

akskuchi commented 1 year ago

Hello,

Thanks for the nice contribution. I am trying to understand how you calculated the correlations with the Composite caption-level likert judgements.

You mentioned in the paper that Composite contains (12K judgements - with F8K (997 imgs), F30K (991 imgs), and MSCOCO (2007 imgs)). In the judgement .csv files at the AMT eval link you provided, there are 3 judgements for each F8K img, 4 for each F30K img, and 4 for each MSCOCO img. This is adding upto ~15K judgements.

Is there a reason why you considered only 12K or am I missing something?

jmhessel commented 1 year ago

Hi there! thanks for your interest in our work. great question! I don't know the answer offhand. It's been a while since I've run these correlations, so I am a bit out of date on the specifics.

From what i remember, there are (3995 images * 3 judgments per) = 11985 (which is what we did). this is the standard number of judgments used for composite, see, for example: https://aclanthology.org/2021.findings-emnlp.395.pdf and https://arxiv.org/pdf/2106.14019.pdf . We have some details about composite in the paper/appendix; perhaps the answer is in there somewhere --- what do you think?

happy to chat more about it; curious to see if you find anything :-)

akskuchi commented 1 year ago

Hello, thanks for your response :) I will look into the work you've linked.

jmhessel commented 1 year ago

closing for now, feel free to re-open if i can be helpful.

bigbrother001 commented 1 year ago

hello, I'm confused too when I download composite dataset from https://imagesdg.wordpress.com/image-to-scene-description-graph/, there are 3 humanjugde score for flickr8k and 4 humanjudge score for flickr30k and MSCOCO, just as @akskuchi said.

jmhessel commented 1 year ago

Could it be that one of the 4 human judgments is made on the reference? As described in the paper, I remember removing human judgments made over the reference captions (which were used to compute the reference-backed metrics).