As described in the main post, the evaluation method is presented as a cool but informal idea and not a rigorous approach. The method is used to create some pretty compelling plots showing the performance of vicuna relative to other models.
I wanted to present some simple evidence that I've found of bias in the evaluation method. Please don't take this as a criticism -- more just some interesting observations!
First is an "ordering" bias, and second is a "close-to-home" bias. Both of these are based on very little data, so they may themselves be subject to criticism.
Method
First, I wrapped the evaluation in the codebase to aggregate all the scores in the following form (see aggregate_scores_from_table.py):
{
"COMPARISON MODEL": {
"total_score1": 576.0, <-- for comparison model
"total_score2": 696.5, <-- for base model (vicuna)
"failed_eval_count": 0, <-- didn't get a score
"better1": 4, <-- comparison model had higher score
"better2": 76, <-- base model had higher score
"tie": 0 <-- tied scores
},
Second, I used gpt-3.5-turbo to re-run the evaluations already present in the repository. And I ran one more evaluation: vicuna-13b against itself (meaning the same answers for model1 and model2). Note that in my re-runs, I went through and manually assigned cleaned up the scores when they were present somewhere in the text.
"Close-to-home" bias
The full aggregated results:
Evaluator: gpt-4 (original evaluator used by the vicuna team)
The score for vicuna changes by ~10% depending on its pairing!? (true for both gpt-4 and gpt-3.5-turbo evaluations)
Generally the vicuna score seems higher when it's paired against a weaker model (695 vs llama), and lower when paired with stronger models (633.5 vs gpt3.5).
Other than gpt-3.5's own score, evaluator gpt-3.5-turbo is actually pretty similar to evaluator gpt-4.
GPT-3.5 seems to be hard on itself -- gpt-4 gives it a higher score (648) than it gives itself (627)!
It is this last point that I call the "close-to-home" bias. I don't know why, but while you might expect gpt-3.5-turbo to bias its own responses higher... it does the opposite.
"Ordering" bias
Running vicuna-13 against itself using evaluator gpt-3.5-turbo yielded a surprising result! The second model (responses identical to the first) scored significantly higher.
While its tricky to compare the results between different pairwise comparisons, the 14 point bias is on the order of differences between the Bard and gpt-3.5 scores. Since vicuna was always set as the second model in the original evaluation, the results (as presented in the blog post) may have a systematic upward bias.
As described in the main post, the evaluation method is presented as a cool but informal idea and not a rigorous approach. The method is used to create some pretty compelling plots showing the performance of vicuna relative to other models.
I wanted to present some simple evidence that I've found of bias in the evaluation method. Please don't take this as a criticism -- more just some interesting observations!
First is an "ordering" bias, and second is a "close-to-home" bias. Both of these are based on very little data, so they may themselves be subject to criticism.
Method
First, I wrapped the evaluation in the codebase to aggregate all the scores in the following form (see aggregate_scores_from_table.py):
Second, I used gpt-3.5-turbo to re-run the evaluations already present in the repository. And I ran one more evaluation: vicuna-13b against itself (meaning the same answers for model1 and model2). Note that in my re-runs, I went through and manually assigned cleaned up the scores when they were present somewhere in the text.
"Close-to-home" bias
The full aggregated results:
Evaluator: gpt-4 (original evaluator used by the vicuna team)
Evaluator: gpt-3.5-turbo
A few observations:
It is this last point that I call the "close-to-home" bias. I don't know why, but while you might expect gpt-3.5-turbo to bias its own responses higher... it does the opposite.
"Ordering" bias
Running vicuna-13 against itself using evaluator gpt-3.5-turbo yielded a surprising result! The second model (responses identical to the first) scored significantly higher.
While its tricky to compare the results between different pairwise comparisons, the 14 point bias is on the order of differences between the Bard and gpt-3.5 scores. Since vicuna was always set as the second model in the original evaluation, the results (as presented in the blog post) may have a systematic upward bias.