open-compass / VLMEvalKit

Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 40+ benchmarks
https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
Apache License 2.0
1.28k stars 182 forks source link

Judge model seems not working #236

Closed ruifengma closed 4 months ago

ruifengma commented 4 months ago

I have been using CogVLM2 run on TextVQA and I got accuracy on 4.06 while the InternVL and other models like MiniCPM got roughly 70 or 80 accuracy. I checked the model output with ground truth answer and found that most answer is correct but which too much useless description which cannot easily match the short answer. Then I use my local Qwen model to be the judge model but the acc is still the same. But the evaluation process is roughly 1 hour longer on the A40 than before. How can I check if the judge model is working correctly?

kennymckormick commented 4 months ago

Hi, @ruifengma , the use of judge model has not been officially supported for datasets like TextVQA. If you want to adopt LLMs to help the evaluation of VQA problems, you need to implement your own logic in evaluation scripts like vqa_eval.py

ruifengma commented 4 months ago

Thanks @kennymckormick , now I see. Btw, where could I find the dataset supporting list for judge model?

kennymckormick commented 4 months ago

Hi, @ruifengma , the following datasets adopt LLMs as choice extractors or judges:

  1. Multi-choice benchmarks or Yes-or-No benchmarks: as choice extractor
  2. Benchmarks that adopt LLMs as judges: MathVista, MMVet, LLaVABench