Baseline模型对比模式结果错误

stay-leave commented 2 weeks ago

使用https://evalscope.readthedocs.io/zh-cn/latest/user_guides/arena.html#id8 因为没有baseline，我先跑了两个模型的预测，得到了两个jsonl预测。然后将一个作为baseline，一个作为target_answers，用openai评估完后得到registry/data/arena/reviews/review_gpt4_pair_baseline.jsonl，如下：应该是tie吧，但是每次都是model_a赢。请问baseline_file的示例和跑一次模型预测一样吗？

{"model_a": "2.2", "model_b": "self", "win_1": "model_a", "win_2": "model_a", "anony": true, "tstamp": 1724924814.5014713, "language": "NA", "question_id": 1, "category": "generic", "question": "How can I improve my time management skills?", "review_text_1": "Both assistants provide relevant and helpful advice on improving time management skills. However, Assistant A provides a more detailed and structured response, with each point elaborated upon. Assistant A also includes additional tips such as eliminating distractions, practicing saying no, and learning to delegate, which are not mentioned by Assistant B. Furthermore, Assistant A ends with a note of encouragement, emphasizing that improving time management skills takes practice and patience. Therefore, Assistant A's response is more comprehensive and helpful. \n\nFinal Verdict: [[A]]", "review_text_2": "Both assistants provide relevant and helpful advice on improving time management skills. They both suggest prioritizing tasks, creating a schedule, breaking down large tasks, and taking breaks. However, Assistant B provides a more detailed and structured response, including additional tips such as eliminating distractions, learning to say no, and delegating tasks. Assistant B also emphasizes the importance of patience and practice in improving time management skills. Therefore, Assistant B's response is more comprehensive and detailed. \n\nFinal Verdict: [[B]]"} {"model_a": "2.2", "model_b": "self", "win_1": "model_a", "win_2": "model_a", "anony": true, "tstamp": 1724924826.6124618, "language": "NA", "question_id": 2, "category": "generic", "question": "What are the most effective ways to deal with stress?", "review_text_1": "Both assistants provide relevant and accurate responses to the user's question about dealing with stress. However, Assistant A's response is more detailed and comprehensive, offering a wider range of strategies and explaining how each one can help manage stress. Assistant A also mentions the importance of trial and error in finding what works best for the individual, and suggests seeking professional help if needed. While Assistant B's response is accurate and relevant, it lacks the depth and detail of Assistant A's response. Therefore, Assistant A provides a better answer to the user's question. \n\nFinal Verdict: [[A]]", "review_text_2": "Both assistants provided relevant and helpful responses to the user's question about dealing with stress. They both mentioned similar strategies such as exercise, relaxation techniques, social support, self-care, and mindfulness. However, Assistant B provided a more detailed and organized response, breaking down the strategies into numbered points and elaborating on each one. Assistant B also mentioned cognitive-behavioral therapy, which Assistant A did not, and concluded by reminding the user that everyone deals with stress differently and that it's okay to seek professional help if needed. Therefore, Assistant B's response is more comprehensive and detailed. \n\nFinal Verdict: [[B]]"}

wangxingjun778 commented 2 weeks ago

感谢，我们也同步复现一下；

“每次都是model_a”赢，可能存在几种情况：一种是本身model_a的答案就比较好，可能题目刚好有利于model_a；另一种的GPT4的偏见，即对被测答案的长度、表达模式等，会有倾向性，导致偏置。

stay-leave commented 1 week ago

就是不太理解第一个问题的的win_1": "model_a", "win_2": "model_a"，明明后面的评价一个是A，一个是B啊。怎么得出来两个win都是model_a？而且后续在计算指标的时报错：

Traceback (most recent call last):
  File "/root/lhd/eval/evalscope/evalscope/run_arena.py", line 206, in <module>
    main()
  File "/root/lhd/eval/evalscope/evalscope/run_arena.py", line 202, in main
    arena_workflow.run(dry_run=args.dry_run)
  File "/root/lhd/eval/evalscope/evalscope/run_arena.py", line 187, in run
    self.get_rating_results()
  File "/root/lhd/eval/evalscope/evalscope/run_arena.py", line 169, in get_rating_results
    res_list = ae.run(self.review_file)
  File "/root/lhd/py_env/swift/lib/python3.10/site-packages/evalscope/evaluator/rating_eval.py", line 176, in run
    res_list = self.eval_samples([data_df])
  File "/root/lhd/py_env/swift/lib/python3.10/site-packages/evalscope/evaluator/rating_eval.py", line 157, in eval_samples
    res = self.compute_pairwise_rating(raw_data)
  File "/root/lhd/py_env/swift/lib/python3.10/site-packages/evalscope/evaluator/rating_eval.py", line 124, in compute_pairwise_rating
    df = df.groupby(['model']).sum()
  File "/root/lhd/py_env/swift/lib/python3.10/site-packages/pandas/core/frame.py", line 9183, in groupby
    return DataFrameGroupBy(
  File "/root/lhd/py_env/swift/lib/python3.10/site-packages/pandas/core/groupby/groupby.py", line 1329, in __init__
    grouper, exclusions, obj = get_grouper(
  File "/root/lhd/py_env/swift/lib/python3.10/site-packages/pandas/core/groupby/grouper.py", line 1043, in get_grouper
    raise KeyError(gpr)
KeyError: 'model'

其他的两两对比和单独打分没有报错，正常运行

modelscope / evalscope

Baseline模型对比模式结果错误 #118