Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).
Besides the introduction of instruction-following scenarios (expected), there are some other differences between the ligh_scenarios exported by the new export_scenario_text.py script and our old version.
The differences include:
ICE is missing
MultiLexSum is missing
NewsQA is missing
Efficiency and robustness scenarios are not removed.
Besides the introduction of instruction-following scenarios (expected), there are some other differences between the ligh_scenarios exported by the new
export_scenario_text.py
script and our old version.The differences include:
See the diff below: