shikiw / OPERA

[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
MIT License
244 stars 22 forks source link

reproducing the result #19

Closed JEONG8652 closed 4 months ago

JEONG8652 commented 4 months ago

Hello, thank you for the great work!

I'm trying to reproduce the POPE / CHAIR results.

when I evaluate the POPE popular set on LLaVA-1.5, I get the result

Accuracy: 0.868
Precision: 0.8780821917808219
Recall: 0.8546666666666667
F1 score: 0.8662162162162163
Yes ratio: 0.4866666666666667

which is a bit higher than issued in README.md. the result was the same when i set the random seed to 42(default), or not set. is there any other possibility affecting the result?

I have also run the LLaVA-1.5 CHAIR evaluation:

CHAIRs    : 49.9
CHAIRi    : 14.1
Recall    : 78.3
Len       : 94.7

this is worse than any baseline from Table 1. can you please share the 500 random COCO list to reproduce the paper result?

shikiw commented 4 months ago

Hi, thanks for your interest!

We are not sure about the actual reason for the differences in reproducing the POPE results. It might be caused by different devices or environments. For the 500 random COCO list, can you email me hqd0037@mail.ustc.edu.cn? I will share the list with you.

Note: To reproduce the results, it is better to use our default setting rather than the efficient version.

JEONG8652 commented 4 months ago

Thank you for replying,

I will keep trying on reproducing the POPE result. Also I sent you an email requesting the list.

Thank you for sharing the list.