open-compass / VLMEvalKit

Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 40+ benchmarks
https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
Apache License 2.0
1.08k stars 154 forks source link

[Benchmark] support TableVQABench #401

Closed hkunzhe closed 3 weeks ago

hkunzhe commented 1 month ago

code: https://github.com/naver-ai/tablevqabench dataset: https://huggingface.co/datasets/terryoo/TableVQA-Bench

FangXinyu-0913 commented 1 month ago

Hi, @hkunzhe. We test the TableVQABench on llava-v1.5-13b. And the result reveals a gap between the accuracy shown in your paper. I'd like to ask you about the versions of the various repositories at the time of the test, and if you used any special prompts.

image image

hkunzhe commented 4 weeks ago

@FangXinyu-0913 Thanks for your time to test this PR! First, I need to clarify that I am just a developer instead of the author of this paper. The TableVQABench does adopt a custom prompt as shown in https://github.com/open-compass/VLMEvalKit/pull/401/files#diff-a804fe08cc046ef30889e3fe6eb84a3d864cd7d30e72a7428e37a9865e24c208R14-R53 and https://github.com/open-compass/VLMEvalKit/pull/401/files#diff-de0832032bcfd53bcba87a73fe82cbf929c6e6c5d2c660f735fd60c76cc3d889R456-R468. I also noticed that the build_prompt of the model is superior to the dataset. Should I append the prompt template to the question field of the tsv file or is there another solution?

FangXinyu-0913 commented 4 weeks ago

@hkunzhe Thank you for your contribution to the VLM community! As you said, the build_prompt of the model is superior to the dataset. To make it easier for the models to be handled uniformly, we plan to create a new class to contain these benchmarks with the dataset build_prompt without the need for the model to process the prompt further! I'll be working on it and probably have a commit and re-testing done by Tuesday.

hkunzhe commented 4 weeks ago

@FangXinyu-0913 It sounds more reasonable. I will be stayed tuned.

FangXinyu-0913 commented 4 weeks ago

@hkunzhe As for the reason why the values of llava-v1.5-13b are different from those in the paper, we found out that it might be a problem on our side (as it also happens in other datasets), as the values we measured on mplug-owl2 are close to those in the paper (see the figure below). Tablevqabench's custom prompts were used. For the models tested in the paper, we have checked that they all prioritize tablevqabench's custom prompts in their testing, so we will not change the overall architecture for now. We'd like to hear more comments from you, and if you have none, we plan to merge this PR early next week.

image

hkunzhe commented 3 weeks ago

@FangXinyu-0913 I have no further comments. Feel free to merge it :)