Add RekaClient, the Vibe-Eval (Scenario and Auto-Evaluator)

stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).

Apache License 2.0

1.89k stars 244 forks source link

Hi, in this PR, I added RekaClient and the Vibe-Eval. I did a canary run (50 instances) on Vibe-Eval scenario using Qwen-VL-Chat. Here's the result:

run_spec.json scenario.json per_instance_stats.json

And for running the full Vibe-Eval scenario, the conf is like:

entries: [
    {description: "vibe_eval:subject=difficulty-normal,model=vlm,num_respondents=1", priority: 1}
    {description: "vibe_eval:subject=difficulty-hard,model=vlm,num_respondents=1", priority: 1}
    ]

The credentials.conf file is like:

rekaApiKey: your-reka-api-key
critiqueModelName: reka/reka-core-20240415
critiqueType: model

Please let me know how I can improve it, thanks!

stanford-crfm / helm

Add RekaClient, the Vibe-Eval (Scenario and Auto-Evaluator) #2675