Fix Open-ended VLM Generation Metrics

ImKeTT commented 4 months ago

I corrected the max_tokens in the Unicorn scenario to 1, the results look fine now.
I modified the input prompt for three open-eneded VLM generation tasks --- Crossmodal_3600, Bingo, Flickr30k, to make VLMs generate more aligned answers with the reference, I have tested 50 instances on these three scenarios (model=model=openai/gpt-4-vision-preview with _get_vibe_eval_critique_metric_specs()), and the results look good.
I changed the metrics of these 3 scenarios to _get_vibe_eval_critique_metric_specs(), which can also be easily changed to _get_prometheus_vision_critique_metric_specs() as they do almost the same job.

@teetone would you take a look? And please let me know how I can improve it. Thanks!

teetone commented 4 months ago

@ImKeTT Could we make the default Prometheus for now so use _get_prometheus_vision_critique_metric_specs()?

ImKeTT commented 4 months ago

I changed metrics for all four open-ended VLM generation taks (Bingo, Flickr30k, Crossmodal 3600, Vibe-Eval) to _get_prometheus_vision_critique_metric_specs(). The credentials.conf is like:

critiqueModelName: huggingface/prometheus-vision-13b-v1.0-hf
critiqueType: model

I set the max_tokens for the Prometheus Vision to 200, or the rating might be truncated out. I've tested 100 instances for Bingo and Vibe-Eval, the results look good. Take a look when you have time, thanks @teetone!

stanford-crfm / helm

Fix Open-ended VLM Generation Metrics #2691