Open imoneoi opened 1 year ago
Hey @imoneoi Can you say how much different in numbers? or is there any data you can provide to reproduce?
We set the temperate to 0 so the results should be stable. I don't think there's any API change on GPT-4 after the one (gpt-4-0613
) announced weeks ago.
As far as I know OpenAI has been careful about making API static. See this tweet https://twitter.com/OfficialLoganK/status/1663934947931897857.
Hi, I have been trying to reproduce the agreement results between GPTs as judges and human. But it turns out that the agreements are largely lower than which reported in the paper. Could it be possible that it is caused by the change of Chatgpt/GPT-4 themselves?
@nightdessert could you elaborate more?
@nightdessert could you elaborate more?
Yes, I take every line in human_judgments.json and use ChatGPT and GPT-4 to annotate. The whole process (prompt, openai api ) is implemented by FastChat. After annotation, I compute agreements of GPT and human by compute_agreement.py. The result (stage 1 w/o tie) of ChatGPT is around 0.5 and GPT-4 is around 0.6, which is largely lower that the reported reesults
@nightdessert could you elaborate more?
here are some specific results of different types of instructions:
GPT-4 original is calculated from the provided gpt4_pair_judgments.json, which gives the same result in the paper. Because I only use the default pari-wise prompt (w/o using reference), so i believe the task not in gray is fair
Our released GPT judgments are generated by gpt-4-0314. Could you try this version?
Are you using this prompt with position swap?
Could you do some manual inspection to see if the GPT-4 judgments make sense?
Today's MT-bench results are very different from yesterday's results (same answer). The GPT-4 API seems to have changed since today all users can use the GPT-4 API (probably quantized ?)