why use gpt4 to evaluate the output?

Thanks for your interests. in our paper! Our benchmark is a diagnostic suite to analyze the hallucination of LVLM. If the answer is only yes or no without furthermore explanations, it will be hard to classify whether it's language hallucination or visual illusion. Sometimes even though GPT-4v generates yes at the beginning of the sentence, the semantic meaning of the left part of the sentences is negative. Therefore, only using keyword matching algorithms will not work. GPT4 can solve this challenge.

tianyi-lab / HallusionBench

why use gpt4 to evaluate the output? #6