Add Prometheus Vision Automatic Evaluation for VLM Evaluation

stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).

Apache License 2.0

1.89k stars 244 forks source link

Hi, in this PR, I added Prometheus Vision Automatic Evaluation. I did a canary run (10 instances) on Bingo scenario using Qwen-VL-Chat. Here's the result: run_spec.json per_instance_stats.json scenario.json

Here's a screenshot after running ./pre-commit.sh：

And for running the full Bingo scenario, the conf is like: entries: [ {description: "bingo:subject=Region,model=qwen/qwen-vl-chat,num_respondents=1", priority: 1} ]

The credentials.conf file is like: critiqueModelName: huggingface/prometheus-vision-13b-v1.0-hf critiqueType: model

Please let me know how I can improve it, thanks!

stanford-crfm / helm

Add Prometheus Vision Automatic Evaluation for VLM Evaluation #2678