lupantech / MathVista

MathVista: data, code, and evaluation for Mathematical Reasoning in Visual Contexts
https://mathvista.github.io/
Creative Commons Attribution Share Alike 4.0 International
226 stars 35 forks source link

Problems 107 and 663 (among others): Duplicated question that does actually not test math ability #5

Closed mbchang closed 9 months ago

mbchang commented 10 months ago
  1. Question 107 and 663 are essentially the same.
  2. The answer choices are given in a confusing way that does not test math ability but tests English grammar. This is also the issue with questions 7, 226, 337, 419, 477, 531.

The question asks to fill in the blank: It is (_) past six. and the answer choices are: A. half B. quarter C. o'clock D. quarter to E. quarter past

The correct answer is B, but E is still conceptually correct (but grammatically wrong), and marking E to be incorrect can be misleading with respect to the model's ability to answer the question. For example, bard answers with quarter past but is marked as incorrect given the string-matching score calculation employed by the paper.

Here is Question 107:

image

Here is Question 663:

image
lupantech commented 9 months ago

Thank you for bringing this to our attention.

Our benchmark methodology defines uniqueness based on the exact match of the image and question text. While two examples may appear conceptually similar, they are included if they pass this criterion. This approach aims to test the robustness of models in consistently providing correct answers, even in the presence of subtle variations. Therefore, the inclusion of questions 107 and 663 serves as a way to challenge model consistency.

The benchmark is designed to assess not only mathematical reasoning but also the model's ability to comprehend and respond appropriately to linguistic nuances. Although both 'B. quarter' and 'E. quarter past' could conceptually fit the blank, the grammatical context dictates the correct answer. This distinction is crucial in evaluating a model's comprehensive understanding beyond mere string-matching.

The instance where Bard answers 'quarter past' and is marked incorrect underscores the benchmark's emphasis on precise language understanding. This aspect is as vital as the mathematical reasoning component of the assessment.