Closed liuyaox closed 2 weeks ago
感谢您对MT-Bench-101的关注!
针对你的两个问题:
2个case的origin_prompt里,HUMAN说的话里都有1个{prediction},此处并没有填充为模型的真实回答,这是正常的么?
这不正常,所有的答案都应该被填充。
我这里拉取最新的OpenCompass代码测试,并没有发现异常。 我的测试方式为将models和judge_models都设为最简单的GPT-4-API,然后从原始数据中随便找两条快速测试,运行 python run.py configs/eval_subjective_mtbench101.py --debug
models = [dict(
abbr='GPT4-Turbo',
type=OpenAI,
path='gpt-4o-2024-05-13', # To compare with the official leaderboard, please use gpt-4-1106-preview
key='', # The key will be obtained from $OPENAI_API_KEY, but you can write down your key here as well
meta_template=api_meta_template,
query_per_second=16,
max_out_len=4096,
max_seq_len=4096,
batch_size=1,
temperature=0.8,
)]
infer = dict(
partitioner=dict(type=SizePartitioner, max_task_size=10000),
runner=dict(
type=LocalRunner,
partition='llm_dev2',
quotatype='auto',
max_num_workers=32,
task=dict(type=OpenICLInferTask),
),
)
judge_models = [dict(
abbr='GPT4-Turbo',
type=OpenAI,
path='gpt-4o-2024-05-13', # To compare with the official leaderboard, please use gpt-4-1106-preview
key='', # The key will be obtained from $OPENAI_API_KEY, but you can write down your key here as well
meta_template=api_meta_template,
query_per_second=16,
max_out_len=4096,
max_seq_len=4096,
batch_size=8,
temperature=0.8,
)]
结果为:
{
"0": {
"origin_prompt": [
{
"role": "SYSTEM",
"prompt": "Please act as an impartial judge follow this instructions: xxx"
},
{
"role": "HUMAN",
"prompt": "The dialogue need to be judged is: \n *** \n \n\n Human: Now there are three people A, B and C. I currently know that A is taller than B and B is taller than C. Who is the tallest currently?\n\nAssistant: Given that A is taller than B and B is taller than C, we can infer the following order of height:\n\n- A is taller than B.\n- B is taller than C.\n\nFrom this information, it is clear that A is the tallest among the three. \n ***\n\n The reference solution is: \n ### \n [{'role': 'user', 'content': 'Now there are three people A, B and C. I currently know that A is taller than B and B is taller than C. Who is the tallest currently?'}, {'role': 'assistant', 'content': 'Based on the given information, A is the tallest among the three people.'}] \n ###\n\n"
}
],
"prediction": "The assistant's response is evaluated according to the given criteria:\n\n1. **Accuracy of Answer**: The assistant correctly identifies A as the tallest based on the provided information, which matches the reference solution. \n\n2. **Completeness and Clarity of Reasoning**: The assistant provides a step-by-step explanation, clearly stating that A is taller than B and B is taller than C, leading to the conclusion that A is the tallest. \n\n3. **Integration of Historical Dialogue**: The assistant effectively uses the historical dialogue to derive the correct answer.\n\n4. **Communication and Instruction**: The assistant communicates the solution in a clear and instructive manner, making it easy for the human to understand and follow the logic.\n\nOverall, the assistant's response is accurate, logically structured, and well-communicated.\n\nRating: [[10]]",
"gold": {
"multi_id": 1,
"task": "GR",
"turn_id": "1"
}
},
"1": {
"origin_prompt": [
{
"role": "SYSTEM",
"prompt": "Please act as an impartial judge follow this instructions: xxx"
},
{
"role": "HUMAN",
"prompt": "The dialogue need to be judged is: \n *** \n \n\n Human: Now there are three people A, B and C. I currently know that A is taller than B and B is taller than C. Who is the tallest currently?\n\nAssistant: Based on the given information, A is the tallest among the three people.\n\n Human: Now there are two more people, D and E. D is higher than B and E is higher than D. Who is the tallest at the moment?\n\nAssistant: Given the updated information:\n- A is taller than B.\n- B is taller than C.\n- D is taller than B.\n- E is taller than D.\n\nTo determine the tallest among A, B, C, D, and E, we need to compare A and E, since they are the tallest in their respective sequences.\n\nHowever, we don't have any direct information comparing A and E. Based on the given information alone, we cannot definitively say who is the tallest between A and E.\n\nIf no additional information is provided, we can only conclude that either A or E is the tallest. \n ***\n\n The reference solution is: \n ### \n [{'role': 'user', 'content': 'Now there are three people A, B and C. I currently know that A is taller than B and B is taller than C. Who is the tallest currently?'}, {'role': 'assistant', 'content': 'Based on the given information, A is the tallest among the three people.'}, {'role': 'user', 'content': 'Now there are two more people, D and E. D is higher than B and E is higher than D. Who is the tallest at the moment?'}, {'role': 'assistant', 'content': 'Based on the additional information, it is not possible to determine who the tallest person is at the moment. The given information only states the relative heights between the individuals but does not provide any specific comparison between A, C, D, and E.'}] \n ###\n\n"
}
],
"prediction": "The assistant's response correctly identifies that there is insufficient information to determine definitively who the tallest person is between A and E. It provides a logical and clear explanation for this conclusion by comparing A and E and recognizing that the relative heights between the sequences cannot be determined with the given information.\n\n### Evaluation Criteria:\n\n1. **Accuracy of the Answer:** The assistant's answer aligns with the reference solution in accurately stating that it is impossible to determine the tallest person definitively.\n2. **Reasoning Process:** The assistant provides a sound reasoning process by comparing the heights of individuals within both sequences and acknowledging the lack of direct information between A and E.\n3. **Integration of Historical Dialogue:** The assistant effectively integrates the historical dialogue and considers all the previous information provided.\n4. **Communication:** The explanation is clear, logical, and understandable, which aids in the human’s comprehension of the situation.\n\nGiven the assistant's thoroughness in addressing the problem and the clarity of its reasoning, it meets the criteria very well.\n\nRating: [[10]]",
"gold": {
"multi_id": 1,
"task": "GR",
"turn_id": "2"
}
},
"2": {
"origin_prompt": [
{
"role": "SYSTEM",
"prompt": "Please act as an impartial judge follow this instructions: xxx"
},
{
"role": "HUMAN",
"prompt": "The dialogue need to be judged is: \n *** \n \n\n Human: Now there are three people A, B and C. I currently know that A is taller than B and B is taller than C. Who is the tallest currently?\n\nAssistant: Based on the given information, A is the tallest among the three people.\n\n Human: Now there are two more people, D and E. D is higher than B and E is higher than D. Who is the tallest at the moment?\n\nAssistant: Based on the additional information, it is not possible to determine who the tallest person is at the moment. The given information only states the relative heights between the individuals but does not provide any specific comparison between A, C, D, and E.\n\n Human: Now, I know that D is higher than A. Who is the highest now?\n\nAssistant: Given the new piece of information that D is taller than A, and previously knowing that E is taller than D, E is the tallest among all the individuals (A, B, C, D, and E). \n ***\n\n The reference solution is: \n ### \n [{'role': 'user', 'content': 'Now there are three people A, B and C. I currently know that A is taller than B and B is taller than C. Who is the tallest currently?'}, {'role': 'assistant', 'content': 'Based on the given information, A is the tallest among the three people.'}, {'role': 'user', 'content': 'Now there are two more people, D and E. D is higher than B and E is higher than D. Who is the tallest at the moment?'}, {'role': 'assistant', 'content': 'Based on the additional information, it is not possible to determine who the tallest person is at the moment. The given information only states the relative heights between the individuals but does not provide any specific comparison between A, C, D, and E.'}, {'role': 'user', 'content': 'Now, I know that D is higher than A. Who is the highest now?'}, {'role': 'assistant', 'content': 'Based on the updated information, if D is higher than A and A is already known to be taller than B and B is taller than C, E is higher than D. then E would be the tallest person at the moment.'}] \n ###\n\n"
}
],
"prediction": "The AI assistant's response accurately identifies that E is the tallest person based on the newly provided information that D is taller than A, and E is taller than D. This matches the reference solution.\n\n1. **Accuracy**: The response correctly concludes that E is the tallest. This is consistent with the reference solution.\n2. **Reasoning Process**: The AI assistant provides a clear and logical reasoning process:\n - It restates the new information that D is taller than A.\n - It uses the previous information that E is taller than D to conclude that E is the tallest.\n3. **Integration of Historical Dialogue**: The AI assistant integrates the historical information correctly, considering all given height comparisons.\n4. **Communication**: The explanation is clear and understandable, effectively communicating the solution.\n\nOverall, the assistant's response meets all the evaluation criteria effectively.\n\nRating: [[10]]",
"gold": {
"multi_id": 1,
"task": "GR",
"turn_id": "3"
}
},
Human结果填充正常,评测prediction结果正常。
第1个case打分正常,第2个case说prediction没被填充所以无法打分,为何会有这种差异?
第1个case,第2个case中,HUMAN说的话里都有1个{prediction}未被填充,两个case都不正常,只不过第1个稍微说了几轮才填充失败,第2个第一轮就填充失败,导致GPT4拒绝回答行为略有差异。 但这个问题不是主要问题,你应该优先解决第一个问题,即为什么不会填充。
Conclusion
由于我这里拉取最新的代码测试无异常,你再检查下是否有使用疏忽?以及你的模型回答是日语,感觉有点奇怪,是否引入了特殊字符导致填充失败?
你好,感谢回复!
format(history=history, prediction='{prediction}')
history的确填充了,但prediction被填充为{prediction},相当于没填充,那应该是后面还有二次填充,这第2次填充在哪里呢,我找了半天没找到。感谢感谢
你好,感谢回复!
format(history=history, prediction='{prediction}')
history的确填充了,但prediction被填充为{prediction},相当于没填充,那应该是后面还有二次填充,这第2次填充在哪里呢,我找了半天没找到。感谢感谢
Please note that here we just construct the Dataset Class which will be used later, just like OpenCompass did in MTBench (https://github.com/open-compass/opencompass/blob/842fb1cd7018151e234e471946b62aa7a2e4d26e/opencompass/datasets/subjective/mtbench.py#L106 and https://github.com/open-compass/opencompass/blob/842fb1cd7018151e234e471946b62aa7a2e4d26e/opencompass/datasets/subjective/mtbench.py#L177). You can check the code in the inference and eval phase or just turn to ask for OpenCompass Maintainer.
Thank you.
Prerequisite
Type
I'm evaluating with the officially supported tasks/models/datasets.
Environment
Reproduces the problem - code/configuration sample
Reproduces the problem - command or script
Reproduces the problem - error message
这是Evaluation后result目录下的json文件,注意看: 有2个case,第2个case的prediction说
The assistant's response is marked as \"{prediction}\" which indicates that there is no actual response provided in the dialogue to evaluate
,意思是说origin_prompt中HUMAN说的话里,那个{prediction}并没有被填充(用模型的回答),而奇怪的是,第1个case的prediction是在正常打分,打了最高分10分。我的问题有2个:
Other information
original issus: https://github.com/open-compass/opencompass/issues/1271
No response