Evidence of bias in the "fun" evaluation method using GPT-4 scores?

As described in the main post, the evaluation method is presented as a cool but informal idea and not a rigorous approach. The method is used to create some pretty compelling plots showing the performance of vicuna relative to other models.

I wanted to present some simple evidence that I've found of bias in the evaluation method. Please don't take this as a criticism -- more just some interesting observations!

First is an "ordering" bias, and second is a "close-to-home" bias. Both of these are based on very little data, so they may themselves be subject to criticism.

Method

First, I wrapped the evaluation in the codebase to aggregate all the scores in the following form (see aggregate_scores_from_table.py):

{
  "COMPARISON MODEL": {
    "total_score1": 576.0,  <-- for comparison model
    "total_score2": 696.5,  <-- for base model (vicuna)
    "failed_eval_count": 0, <-- didn't get a score
    "better1": 4,  <-- comparison model had higher score
    "better2": 76, <-- base model had higher score
    "tie": 0 <-- tied scores
  },

Second, I used gpt-3.5-turbo to re-run the evaluations already present in the repository. And I ran one more evaluation: vicuna-13b against itself (meaning the same answers for model1 and model2). Note that in my re-runs, I went through and manually assigned cleaned up the scores when they were present somewhere in the text.

"Close-to-home" bias

The full aggregated results:

Evaluator: gpt-4 (original evaluator used by the vicuna team)

{
  "alpaca-13b": {
    "total_score1": 576.0,
    "total_score2": 696.5,
    "failed_eval_count": 0,
    "better1": 4,
    "better2": 76,
    "tie": 0
  },
  "gpt35": {
    "total_score1": 684.0,
    "total_score2": 633.5,
    "failed_eval_count": 0,
    "better1": 42,
    "better2": 22,
    "tie": 16
  },
  "llama": {
    "total_score1": 523.0,
    "total_score2": 695.0,
    "failed_eval_count": 0,
    "better1": 5,
    "better2": 75,
    "tie": 0
  },
  "bard": {
    "total_score1": 661.5,
    "total_score2": 662.0,
    "failed_eval_count": 0,
    "better1": 28,
    "better2": 39,
    "tie": 13
  }
}

Evaluator: gpt-3.5-turbo

{
  "alpaca-13b": {
    "total_score1": 583.0,
    "total_score2": 700.0,
    "failed_eval_count": 0,
    "better1": 3,
    "better2": 77,
    "tie": 0
  },
  "gpt35": {
    "total_score1": 627.0,
    "total_score2": 665.0,
    "failed_eval_count": 0,
    "better1": 14,
    "better2": 65,
    "tie": 1
  },
  "llama-13b": {
    "total_score1": 588.0,
    "total_score2": 708.0,
    "failed_eval_count": 0,
    "better1": 0,
    "better2": 79,
    "tie": 1
  },
  "bard": {
    "total_score1": 609.0,
    "total_score2": 669.0,
    "failed_eval_count": 1,
    "better1": 8,
    "better2": 71,
    "tie": 0
  }
}

A few observations:

The score for vicuna changes by ~10% depending on its pairing!? (true for both gpt-4 and gpt-3.5-turbo evaluations)
Generally the vicuna score seems higher when it's paired against a weaker model (695 vs llama), and lower when paired with stronger models (633.5 vs gpt3.5).
Other than gpt-3.5's own score, evaluator gpt-3.5-turbo is actually pretty similar to evaluator gpt-4.
GPT-3.5 seems to be hard on itself -- gpt-4 gives it a higher score (648) than it gives itself (627)!

It is this last point that I call the "close-to-home" bias. I don't know why, but while you might expect gpt-3.5-turbo to bias its own responses higher... it does the opposite.

"Ordering" bias

Running vicuna-13 against itself using evaluator gpt-3.5-turbo yielded a surprising result! The second model (responses identical to the first) scored significantly higher.

   "vicuna-13b-20230322-new-hp-fp16": {
    "total_score1": 606.0,
    "total_score2": 620.0,
    "failed_eval_count": 1,
    "better1": 1,
    "better2": 11,
    "tie": 67
  },

While its tricky to compare the results between different pairwise comparisons, the 14 point bias is on the order of differences between the Bard and gpt-3.5 scores. Since vicuna was always set as the second model in the original evaluation, the results (as presented in the blog post) may have a systematic upward bias.

lm-sys / FastChat