Closed mbchang closed 9 months ago
Thank you for bringing this to our attention.
Our benchmark methodology defines uniqueness based on the exact match of the image and question text. While two examples may appear conceptually similar, they are included if they pass this criterion. This approach aims to test the robustness of models in consistently providing correct answers, even in the presence of subtle variations. Therefore, the inclusion of questions 107 and 663 serves as a way to challenge model consistency.
The benchmark is designed to assess not only mathematical reasoning but also the model's ability to comprehend and respond appropriately to linguistic nuances. Although both 'B. quarter' and 'E. quarter past' could conceptually fit the blank, the grammatical context dictates the correct answer. This distinction is crucial in evaluating a model's comprehensive understanding beyond mere string-matching.
The instance where Bard answers 'quarter past' and is marked incorrect underscores the benchmark's emphasis on precise language understanding. This aspect is as vital as the mathematical reasoning component of the assessment.
The question asks to fill in the blank:
It is (_) past six.
and the answer choices are: A.half
B.quarter
C.o'clock
D.quarter to
E.quarter past
The correct answer is B, but E is still conceptually correct (but grammatically wrong), and marking E to be incorrect can be misleading with respect to the model's ability to answer the question. For example, bard answers with
quarter past
but is marked as incorrect given the string-matching score calculation employed by the paper.Here is Question 107:
Here is Question 663: