Gemini and GPT4V output file

lupantech / MathVista

MathVista: data, code, and evaluation for Mathematical Reasoning in Visual Contexts

https://mathvista.github.io/

Creative Commons Attribution Share Alike 4.0 International

226 stars 35 forks source link

Gemini and GPT4V output file #23

Closed ggg0919 closed 6 months ago

ggg0919 commented 6 months ago

Thanks for your great work! I am currently comparing the performance of various models in testmini. I've found the corresponding output file for output_llava_llama_2_13b.json under MathVista/results/llava/, but the output files for Gemini and GPT-4v are missing. Can you provide the output files for GPT4V and Gemini? Thank you!

lupantech commented 6 months ago

Thank you so much for your kind words. Currently, we have decided not to release the responses from advanced models such as Gemini and GPT-4V in order to minimize data contamination of the benchmark.

ggg0919 commented 6 months ago

Thank you so much for your kind words. Currently, we have decided not to release the responses from advanced models such as Gemini and GPT-4V in order to minimize data contamination of the benchmark.

Thank you for your reply. I understand your decision not to release the response of the advanced model. May I ask if they have any special question prompts? I used the prompt under data/query.json, but Gemini Pro 1.0 only achieved 37.4 on testmini, far from 45.2 on the Leaderboard. May I ask what may be causing this? Looking forward to your answer, thank you!

lupantech commented 6 months ago

The prompt under data/query.json is our custom-designed version for convenient evaluations, which is an optional version and may not be optimized. The discrepancy in your case might be due to Google using an optimized prompt template when generating responses (see section 8.6.12 of the technical report for the prompt template that Gemini used). Alternatively, the responses in your evaluation could be incomplete due to rate limiting. For more details, please reach out to Google's Gemini team.