Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).
If the generated output is empty, then when calling the GPT-4o with the prompt template inside the template, GPT-4o often hallucinates a fictional model output. This can result in false positives or incorrect formatting.
Instead, if the generated output is empty, we set prompt_text to the empty string, reasoning to BLOCKED_REQUEST_OR_EMPTY_RESPONSE and score to 0.0.
If the generated output is empty, then when calling the GPT-4o with the prompt template inside the template, GPT-4o often hallucinates a fictional model output. This can result in false positives or incorrect formatting.
Instead, if the generated output is empty, we set
prompt_text
to the empty string,reasoning
toBLOCKED_REQUEST_OR_EMPTY_RESPONSE
and score to 0.0.