Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).
Hi, in this PR, I added Prometheus Vision Automatic Evaluation. I did a canary run (10 instances) on Bingo scenario using Qwen-VL-Chat. Here's the result:
run_spec.jsonper_instance_stats.jsonscenario.json
Here's a screenshot after running ./pre-commit.sh:
And for running the full Bingo scenario, the conf is like:
entries: [ {description: "bingo:subject=Region,model=qwen/qwen-vl-chat,num_respondents=1", priority: 1} ]
The credentials.conf file is like:
critiqueModelName: huggingface/prometheus-vision-13b-v1.0-hf critiqueType: model
Hi, in this PR, I added Prometheus Vision Automatic Evaluation. I did a canary run (10 instances) on Bingo scenario using Qwen-VL-Chat. Here's the result: run_spec.json per_instance_stats.json scenario.json
Here's a screenshot after running ./pre-commit.sh:
And for running the full Bingo scenario, the conf is like:
entries: [ {description: "bingo:subject=Region,model=qwen/qwen-vl-chat,num_respondents=1", priority: 1} ]
The credentials.conf file is like:
critiqueModelName: huggingface/prometheus-vision-13b-v1.0-hf critiqueType: model
Please let me know how I can improve it, thanks!