stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).
https://crfm.stanford.edu/helm
Apache License 2.0
1.89k stars 244 forks source link

Add Model as Judge for medical scenarios #2727

Closed farzaank closed 3 months ago

farzaank commented 3 months ago

This updates these open ended medical scenarios (liveQA and medicationQA) to use model as judge.

This implementation uses 4 buckets as human judges did in the LiveQA paper, and then 3 for MedicationQA (no recommended approach).