microsoft / promptbench

A unified evaluation framework for large language models
http://aka.ms/promptbench
MIT License
2.45k stars 182 forks source link

Access to per-sample evaluation results #64

Closed adhirajghosh closed 4 months ago

adhirajghosh commented 6 months ago

Hi, Thanks for the great work! For my current project, I am looking to use the sample-wise evaluation results of VLMs for the experiments you have conducted.

If you can provide me with the sample-wise evaluation logs on the multimodal datasets mentioned(VQAv2, NoCaps, MMMU, MathVista, AI2D, ChartQA, ScienceQA) for the models evaluated(BLIP2, LLaVA Qwen-VL, Qwen-VL-Chat, InternLM-XComposer2-VL, GPT-4v, Gemini Pro Vision, Qwen-VL-Max, Qwen-VL-Plus), I would greatly appreciate it. If I missed a dataset or model, please feel free to incorporate them.

MingxuanXia commented 6 months ago

Hi, I'm sorry to tell you that we cannot provide the sample-wise evaluation logs for you.

github-actions[bot] commented 4 months ago

Stale issue message