stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).
https://crfm.stanford.edu/helm
Apache License 2.0
1.89k stars 244 forks source link

Fix Open-ended VLM Generation Metrics #2691

Closed ImKeTT closed 4 months ago

ImKeTT commented 4 months ago
  1. I corrected the max_tokens in the Unicorn scenario to 1, the results look fine now.
  2. I modified the input prompt for three open-eneded VLM generation tasks --- Crossmodal_3600, Bingo, Flickr30k, to make VLMs generate more aligned answers with the reference, I have tested 50 instances on these three scenarios (model=model=openai/gpt-4-vision-preview with _get_vibe_eval_critique_metric_specs()), and the results look good.
  3. I changed the metrics of these 3 scenarios to _get_vibe_eval_critique_metric_specs(), which can also be easily changed to _get_prometheus_vision_critique_metric_specs() as they do almost the same job.

@teetone would you take a look? And please let me know how I can improve it. Thanks!

teetone commented 4 months ago

@ImKeTT Could we make the default Prometheus for now so use _get_prometheus_vision_critique_metric_specs()?

ImKeTT commented 4 months ago

I changed metrics for all four open-ended VLM generation taks (Bingo, Flickr30k, Crossmodal 3600, Vibe-Eval) to _get_prometheus_vision_critique_metric_specs(). The credentials.conf is like:

critiqueModelName: huggingface/prometheus-vision-13b-v1.0-hf
critiqueType: model

I set the max_tokens for the Prometheus Vision to 200, or the rating might be truncated out. I've tested 100 instances for Bingo and Vibe-Eval, the results look good. Take a look when you have time, thanks @teetone!