Open kkk123698745 opened 1 week ago
Hi, @kkk123698745 , I will rerun the evaluation and post the results here (fact check: are you using gpt-4o-20240806?). By the way, we are happy to update the evaluation result if you can create a PR to integrate ur better evaluation setting.
@kennymckormick Thanks for your reply, Expecting your result of rerun. Yeah, we used gpt-4o-20240806. In deed,we just run your code and got a score at 61.6% without any modification on prompt and got a score at 65.3% with the prompt adding '\nsolution:' at the tail.The full prompt like this:'Hint: Please answer the question and provide the correct option letter, e.g., A, B, C, D, at the end. Question: Is the number of purple metallic things that are behind the small green motorbike less than the number of blue metal articulated buss? Choices: (A) Yes (B) No solution:'
Hi, @kkk123698745 , I have performed two checks:
Screenshot:
Screenshot:
I think the most possible reason for such difference is that, since GPT-4o is a proprietary model (a black box for us), maybe OpenAI has made some modifications to the model behind so the results now are different. For API models, our evaluation results only stands for the performance at the corresponding timestamp.
After reproducing the gpt-4o on mathvista,we found the result of gpt4o on MathVista is 61.6, which is lower than the 62.7 on the list. Is there something i missed to get that scores. We have also test the gpt-4o with the prompt ending with 'solution:‘ and got a score at 65.3%。