MT-bench results are different today

lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

Apache License 2.0

36.98k stars 4.56k forks source link

MT-bench results are different today #1898

Open imoneoi opened 1 year ago

imoneoi commented 1 year ago

Today's MT-bench results are very different from yesterday's results (same answer). The GPT-4 API seems to have changed since today all users can use the GPT-4 API (probably quantized ?)

infwinston commented 1 year ago

Hey @imoneoi Can you say how much different in numbers? or is there any data you can provide to reproduce?

We set the temperate to 0 so the results should be stable. I don't think there's any API change on GPT-4 after the one (gpt-4-0613) announced weeks ago. As far as I know OpenAI has been careful about making API static. See this tweet https://twitter.com/OfficialLoganK/status/1663934947931897857.

nightdessert commented 1 year ago

Hi, I have been trying to reproduce the agreement results between GPTs as judges and human. But it turns out that the agreements are largely lower than which reported in the paper. Could it be possible that it is caused by the change of Chatgpt/GPT-4 themselves?

infwinston commented 1 year ago

@nightdessert could you elaborate more?

nightdessert commented 1 year ago

@nightdessert could you elaborate more?

Yes, I take every line in human_judgments.json and use ChatGPT and GPT-4 to annotate. The whole process (prompt, openai api ) is implemented by FastChat. After annotation, I compute agreements of GPT and human by compute_agreement.py. The result (stage 1 w/o tie) of ChatGPT is around 0.5 and GPT-4 is around 0.6, which is largely lower that the reported reesults

nightdessert commented 1 year ago

@nightdessert could you elaborate more?

here are some specific results of different types of instructions:

GPT-4 original is calculated from the provided gpt4_pair_judgments.json, which gives the same result in the paper. Because I only use the default pari-wise prompt (w/o using reference), so i believe the task not in gray is fair

merrymercy commented 1 year ago

Our released GPT judgments are generated by gpt-4-0314. Could you try this version?

Are you using this prompt with position swap?

Could you do some manual inspection to see if the GPT-4 judgments make sense?