Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in HEIM (https://arxiv.org/abs/2311.04287) and vision-language models in VHELM (https://arxiv.org/abs/2410.07112).
Once @yifanmai's PR https://github.com/stanford-crfm/helm/pull/1323 is in, we can support and evaluate https://huggingface.co/stanford-crfm/BioMedLM with the biomedical tasks here: https://github.com/stanford-crfm/helm/pull/1332.