lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.59k stars 4.52k forks source link

Evidence of bias in the "fun" evaluation method using GPT-4 scores? #826

Open thandal opened 1 year ago

thandal commented 1 year ago

As described in the main post, the evaluation method is presented as a cool but informal idea and not a rigorous approach. The method is used to create some pretty compelling plots showing the performance of vicuna relative to other models.

I wanted to present some simple evidence that I've found of bias in the evaluation method. Please don't take this as a criticism -- more just some interesting observations!

First is an "ordering" bias, and second is a "close-to-home" bias. Both of these are based on very little data, so they may themselves be subject to criticism.

Method

First, I wrapped the evaluation in the codebase to aggregate all the scores in the following form (see aggregate_scores_from_table.py):

{
  "COMPARISON MODEL": {
    "total_score1": 576.0,  <-- for comparison model
    "total_score2": 696.5,  <-- for base model (vicuna)
    "failed_eval_count": 0, <-- didn't get a score
    "better1": 4,  <-- comparison model had higher score
    "better2": 76, <-- base model had higher score
    "tie": 0 <-- tied scores
  },

Second, I used gpt-3.5-turbo to re-run the evaluations already present in the repository. And I ran one more evaluation: vicuna-13b against itself (meaning the same answers for model1 and model2). Note that in my re-runs, I went through and manually assigned cleaned up the scores when they were present somewhere in the text.

"Close-to-home" bias

The full aggregated results:

Evaluator: gpt-4 (original evaluator used by the vicuna team)

{
  "alpaca-13b": {
    "total_score1": 576.0,
    "total_score2": 696.5,
    "failed_eval_count": 0,
    "better1": 4,
    "better2": 76,
    "tie": 0
  },
  "gpt35": {
    "total_score1": 684.0,
    "total_score2": 633.5,
    "failed_eval_count": 0,
    "better1": 42,
    "better2": 22,
    "tie": 16
  },
  "llama": {
    "total_score1": 523.0,
    "total_score2": 695.0,
    "failed_eval_count": 0,
    "better1": 5,
    "better2": 75,
    "tie": 0
  },
  "bard": {
    "total_score1": 661.5,
    "total_score2": 662.0,
    "failed_eval_count": 0,
    "better1": 28,
    "better2": 39,
    "tie": 13
  }
}

Evaluator: gpt-3.5-turbo

{
  "alpaca-13b": {
    "total_score1": 583.0,
    "total_score2": 700.0,
    "failed_eval_count": 0,
    "better1": 3,
    "better2": 77,
    "tie": 0
  },
  "gpt35": {
    "total_score1": 627.0,
    "total_score2": 665.0,
    "failed_eval_count": 0,
    "better1": 14,
    "better2": 65,
    "tie": 1
  },
  "llama-13b": {
    "total_score1": 588.0,
    "total_score2": 708.0,
    "failed_eval_count": 0,
    "better1": 0,
    "better2": 79,
    "tie": 1
  },
  "bard": {
    "total_score1": 609.0,
    "total_score2": 669.0,
    "failed_eval_count": 1,
    "better1": 8,
    "better2": 71,
    "tie": 0
  }
}

A few observations:

It is this last point that I call the "close-to-home" bias. I don't know why, but while you might expect gpt-3.5-turbo to bias its own responses higher... it does the opposite.

"Ordering" bias

Running vicuna-13 against itself using evaluator gpt-3.5-turbo yielded a surprising result! The second model (responses identical to the first) scored significantly higher.

   "vicuna-13b-20230322-new-hp-fp16": {
    "total_score1": 606.0,
    "total_score2": 620.0,
    "failed_eval_count": 1,
    "better1": 1,
    "better2": 11,
    "tie": 67
  },

While its tricky to compare the results between different pairwise comparisons, the 14 point bias is on the order of differences between the Bard and gpt-3.5 scores. Since vicuna was always set as the second model in the original evaluation, the results (as presented in the blog post) may have a systematic upward bias.

merrymercy commented 1 year ago

Thanks for sharing the findings. We also noticed some of them and are working on addressing these biases.