open-compass / VLMEvalKit

Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 40+ benchmarks
https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
Apache License 2.0
1.08k stars 154 forks source link

Fix mathvista Idefics2 #393

Closed HugoLaurencon closed 1 month ago

HugoLaurencon commented 1 month ago

A very small change in the prompting of MathVista. This can change a bit the performance (up to 1 point).

Idefics2 was fine-tuned with a specific prompt for MCQ. In this PR, I add a sentence that was always seen during the fine-tuning when the model is expected to answer with a letter for MCQ.

Feel free to directly merge if you think this modification makes sense.

kennymckormick commented 1 month ago

Thanks, we will re-evaluate and update the results of Idefics2 on MathVista.

kennymckormick commented 1 month ago

BTW, a community contributor is also trying to add support for Idefics3, do you have time to take a look (on sth like evalset-specific prompts)?

https://github.com/open-compass/VLMEvalKit/pull/379

HugoLaurencon commented 1 month ago

Actually I haven't tested on Idefics2, but only Idefics2-large that we are going to release soon (maybe this week)

I think there's not much to change. The only thing is that the PR in Transformers is not merged yet. Apart from that, the custom prompts for MMMU and MathVista remains valid, and the other one for MMStar looks good too.

There are some (hopefully small) discrepancies between generating with our internal repo and Transformers integration. If the scores differ too much from what we have reported, don't hesitate to ping me so I can have a look!

kennymckormick commented 1 month ago

@HugoLaurencon Unfortunately, I find this modification does not work for Idefics2-8B. Its original rating on MathVista was 52.2, after the update, it becomes 51.4. You can have a double check by running torchrun --nproc-per-node=$GPU run.py --model idefics2_8b --data MathVista_MINI with VLMEvalKit on your side.

HugoLaurencon commented 1 month ago

Okay thanks for the evaluation!

Maybe it's because recently the integration of Idefics2 was broken with the recent versions of Transformers, could you tell me your version?

I will try to investigate a bit more

kennymckormick commented 1 month ago

Okay thanks for the evaluation!

Maybe it's because recently the integration of Idefics2 was broken with the recent versions of Transformers, could you tell me your version?

I will try to investigate a bit more

The results is obtained with transformers=4.44.0, torch=2.0.1+cu118

HugoLaurencon commented 1 month ago

Thanks I'll have a look when I find time! Also, if you still have the details of MMMU evaluation scores for Idefics2 for all the categories in your cache, would it be possible to copy paste the whole output of VLMEvalKit here, to compare with what I have with slightly different prompts? If it's not in your cache no worries no need to spend time recomputing, I'll do it!

kennymckormick commented 1 month ago

Hi, @HugoLaurencon We have created a huggingface dataset named OpenVLMRecords: https://huggingface.co/datasets/VLMEval/OpenVLMRecords. You can find the records in this repo.

HugoLaurencon commented 1 month ago

Very nice feature!