Closed huiyeruzhou closed 4 months ago
Hello, thank you for your attention.
Different prompts may influence the output of the model and the corresponding answer extraction. However, the variation in the results of our article is not as significant as you reported—a gap within 2%. Since the evaluation code for different models is the same, the assessment remains fair.
Regarding the extraction issue, MATH-Vision has now been integrated in open-compass/VLMEvalKit. This uses language models to extract answers, thereby avoiding the aforementioned problems. I hope this information is helpful to you.
Hi! I've got some problem when I check the evaluate policy. Generate the answer without a newline, such as
(A) something
is a common pattern,however it will be interpreted asasomething
by evaluateexample(the extraction is expected to be
c
instead ofc60
):experiment: in line 26~27, we have:
by change it into(remove the
\n
in pattern)I got a doubled pass rate in llava-7b(7.24->16.12%)
Question: The pattern seems to be misleading when we try to extract a multiple choice answer, why should we except a newline between choice character and its content or at the end?
Thanks for any help!