Closed ruifengma closed 4 months ago
Hi, @ruifengma , the use of judge model has not been officially supported for datasets like TextVQA. If you want to adopt LLMs to help the evaluation of VQA problems, you need to implement your own logic in evaluation scripts like vqa_eval.py
Thanks @kennymckormick , now I see. Btw, where could I find the dataset supporting list for judge model?
Hi, @ruifengma , the following datasets adopt LLMs as choice extractors or judges:
I have been using CogVLM2 run on TextVQA and I got accuracy on 4.06 while the InternVL and other models like MiniCPM got roughly 70 or 80 accuracy. I checked the model output with ground truth answer and found that most answer is correct but which too much useless description which cannot easily match the short answer. Then I use my local Qwen model to be the judge model but the acc is still the same. But the evaluation process is roughly 1 hour longer on the A40 than before. How can I check if the judge model is working correctly?