stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).
https://crfm.stanford.edu/helm
Apache License 2.0
1.89k stars 244 forks source link

Add Prometheus Vision Automatic Evaluation for VLM Evaluation #2678

Closed YiyangZhou closed 4 months ago

YiyangZhou commented 4 months ago

Hi, in this PR, I added Prometheus Vision Automatic Evaluation. I did a canary run (10 instances) on Bingo scenario using Qwen-VL-Chat. Here's the result: run_spec.json per_instance_stats.json scenario.json

Here's a screenshot after running ./pre-commit.sh

image

And for running the full Bingo scenario, the conf is like: entries: [ {description: "bingo:subject=Region,model=qwen/qwen-vl-chat,num_respondents=1", priority: 1} ]

The credentials.conf file is like: critiqueModelName: huggingface/prometheus-vision-13b-v1.0-hf critiqueType: model

Please let me know how I can improve it, thanks!

teetone commented 4 months ago

Closing in favor of https://github.com/stanford-crfm/helm/pull/2691