Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).
Addresses a failure mode of MATH on Claude 3.5 Sonnet - it does not wrap the final answer in LaTeX box formatting, leading to a very low accuracy score. This pull request changes the output_format_instructions run expander to add instructions to do so.
Addresses a failure mode of MATH on Claude 3.5 Sonnet - it does not wrap the final answer in LaTeX box formatting, leading to a very low accuracy score. This pull request changes the
output_format_instructions
run expander to add instructions to do so.