Closed FangXinyu-0913 closed 1 month ago
Hi @uyzhang. Thank you very much for pointing this out, because of time we just simply added the model and didn't make any further changes to the hyperparameters and system prompt based on the evaluation details. We will align the result when we have time, thanks for the reminder.
Why is there a significant difference between the scores obtained using the llama-3.2-11b-instruct model here and the scores reported by the Hugging Face official benchmark? For example, in the AI2D benchmark, the official reported score is 91.1 (https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct), but using this code, I only obtained around 75.