tianyi-lab / HallusionBench

[CVPR'24] HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
BSD 3-Clause "New" or "Revised" License
254 stars 7 forks source link

why use gpt4 to evaluate the output? #6

Closed eyuansu62 closed 11 months ago

eyuansu62 commented 11 months ago

Based on the paper, the output of LVM is {yes, no, unknown}.

FuxiaoLiu commented 11 months ago

Thanks for your interests. in our paper! Our benchmark is a diagnostic suite to analyze the hallucination of LVLM. If the answer is only yes or no without furthermore explanations, it will be hard to classify whether it's language hallucination or visual illusion. Sometimes even though GPT-4v generates yes at the beginning of the sentence, the semantic meaning of the left part of the sentences is negative. Therefore, only using keyword matching algorithms will not work. GPT4 can solve this challenge.