复现对不齐问题 - Githubissues

作者，您好！非常感谢您的工作。我最近在研究大模型评估方向，尝试使用 GPT4 和 JudgeLM 的 prompt 在 LLMBar 上进行评估，prompt 模板如下所示，与 LLMBar 的主要区别在于让大模型为两个 answer 打分而不是直接回答谁更好。

You are a helpful and precise assistant for checking the quality of the answer. [Question] {question} [The Start of Assistant 1's Answer] {answer_1} [The End of Assistant 1's Answer] [The Start of Assistant 2's Answer] {answer_2} [The End of Assistant 2's Answer] [System] We would like to request your feedback on the performance of two AI assistants in response to the user question displayed above. Please rate the helpfulness, relevance, accuracy, level of details of their responses. Each assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance. Please first output a single line containing only two values indicating the scores for Assistant 1 and 2, respectively. The two scores are separated by a space. In the subsequent line, please provide a comprehensive explanation of your evaluation, avoiding any potential bias and ensuring that the order in which the responses were presented does not affect your judgment.

然而，测试结果与论文中 GPT4/Vanilla_NoRules 的结果相比差距很大，除 Natural（下降 8%）和 GPTOut（上升 4%）外，其余三个测试集上的平均准确率均下降约 20%-30%。这个测试脚本在以前使用本地模型在其他数据集上测试的时候没有问题。此外，我还用 LLMBar 中的 LLMEvaluator/evaluate.py 测试了一下，除了在 openai_completion() 函数中使用了自己的模型调用函数之外没有任何改动，也就是说采用的依然是 LLMBar 的 prompt，却发现 Neighbor 和 Manual 上的平均准确率分别为 53.4% 和 68.5%，与原论文中的 64.2% 和 75.0% 也有较大差距，另几个数据集上的测试结果与论文中的基本一致。尝试了 GPT4 的不同版本，如 gpt-4-0125-preview 和 gpt-4-0613，差距基本没变。上述过程没有采用任何其他提示策略。请问作者觉得可能是什么原因导致了上面的两种情况？

princeton-nlp / LLMBar

复现对不齐问题 #2