mathllm / MATH-V

MATH-Vision dataset and code to measure Multimodal Mathematical Reasoning capabilities.
https://mathllm.github.io/mathvision/
MIT License
69 stars 5 forks source link

Pattern match failed to capture `(A) A` #3

Closed huiyeruzhou closed 4 months ago

huiyeruzhou commented 5 months ago

Hi! I've got some problem when I check the evaluate policy. Generate the answer without a newline, such as (A) something is a common pattern,however it will be interpreted as asomething by evaluate

example(the extraction is expected to be c instead of c60): image

experiment: in line 26~27, we have:

                if model_answer.endswith(f" {c}.") or model_answer.endswith(f" ({c}).") or model_answer.startswith(f"{c}\n") or model_answer.startswith(f"({c})\n") or model_answer.startswith(f"({c}) {c}\n"):
                    model_answer = c

by change it into(remove the \n in pattern)

               if response.endswith(f" {c}.") or response.endswith(f" ({c}).") or response.startswith(f"{c}") or response.startswith(f"({c})"):
                    model_answer = c

I got a doubled pass rate in llava-7b(7.24->16.12%)

Question: The pattern seems to be misleading when we try to extract a multiple choice answer, why should we except a newline between choice character and its content or at the end?

Thanks for any help!

mathvision-cuhk commented 4 months ago

Hello, thank you for your attention.

Different prompts may influence the output of the model and the corresponding answer extraction. However, the variation in the results of our article is not as significant as you reported—a gap within 2%. Since the evaluation code for different models is the same, the assessment remains fair.

Regarding the extraction issue, MATH-Vision has now been integrated in open-compass/VLMEvalKit. This uses language models to extract answers, thereby avoiding the aforementioned problems. I hope this information is helpful to you.