open-compass / CriticEval

[NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs
https://arxiv.org/abs/2402.13764
Apache License 2.0
31 stars 1 forks source link

[Bug] 主观评测得分解析错误 #3

Open WencWu opened 1 day ago

WencWu commented 1 day ago

例如以下例子中 **Score: 9** 被错误解析为 5.0

{
    "question": "To cook perfectly golden pancakes,",
    "obj": {
        "generation_a": "Mix the ingredients together in a bowl and pour it onto a hot pan.",
        "generation_b": "To cook golden pancakes, mix flour, eggs, milk, and baking powder. Heat a non-stick pan, scoop batter, and cook until bubbles form. Flip and cook until golden. Serve with maple syrup.🥞",
        "preference": "ansB",
    },
    "sub": {
        "generation_a": "To cook golden pancakes, mix flour, eggs, milk, and baking powder. Heat a non-stick pan, scoop batter, and cook until bubbles form. Flip and cook until golden. Serve with maple syrup.🥞",
        "generation_b": "Mix the ingredients together in a bowl and pour it onto a hot pan.",
        "feedback": "Answer A provides a more detailed, step-by-step process of how to cook pancakes. It specifies the ingredients needed, the type of pan to use, and how to know when the pancake is ready to flip. It also suggests a way to serve the pancakes. Answer B, on the other hand, is more general and lacks these details. \n\nDecision: A.",
        "preference": "ansA",
        "llm_a": "openbuddy-70b",
        "llm_b": "internlm-7b",
    },
    "data_source": "piqa",
    "metadata": {"quality_a_sub": "high", "quality_b_sub": "low"},
    "evaluation": {
        "cot": "### Evaluation of Generated Feedback\n\n**Generated Feedback Analysis:**\n\n1. **Clarity and Detail:**\n   - The generated feedback accurately highlights that Answer A provides a clear and detailed step-by-step guide, mentioning specific ingredients and cooking steps. This aligns well with the reference feedback, which also emphasizes the detailed nature of Answer A.\n\n2. **Additional Information:**\n   - The generated feedback notes the suggestion of serving with maple syrup, which is a practical addition. This aspect is not explicitly mentioned in the reference feedback, making the generated feedback slightly more comprehensive in this regard.\n\n3. **Engagement:**\n   - The generated feedback mentions the use of an emoji in Answer A, adding an element of engagement. This is an additional observation not covered in the reference feedback, which could be seen as enhancing the evaluation by considering the engagement factor.\n\n4. **Critique of Answer B:**\n   - Both the generated and reference feedbacks agree that Answer B lacks detail and completeness. The generated feedback expands on this by pointing out the absence of engaging elements, which is a valid critique but not present in the reference feedback.\n\n5. **Overall Preference:**\n   - The generated feedback clearly states a preference for Answer A, providing reasons that are consistent with the reference feedback. It also adds value by discussing engagement and additional serving suggestions.\n\n**Comparison with Reference Feedback:**\n\n- The generated feedback provides a more comprehensive analysis by including aspects of engagement and additional serving suggestions, which are not present in the reference feedback. This makes the generated feedback slightly more detailed and engaging.\n\n**Score: 9**\n\nThe generated feedback is slightly better than the reference feedback due to its additional insights into engagement and serving suggestions, while still maintaining a clear and accurate analysis of the answers.",
        "score": 5.0,
    },
}
WencWu commented 18 hours ago

还有 "Decision: .+" 也是同样的问题,若存在 "**Decision: 6**" 也无法正确解析,是否将正则表达式修改为 "Decision: [\.\d]+" 会更好呢?