Showing model degradation with LLMBar?

princeton-nlp / LLMBar

[ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following

MIT License

105 stars 5 forks source link

Hi @DavidFarago, thank you for your interest and great observation!

Yes, we agree that GPT-3.5-turbo-0301 seems to perform better than GPT-3.5-turbo-0613 on LLMBar. The reasons for this might be complicated, and the "urban legend" you mentioned could be just one of them. Please also note that the ADVERSARIAL set is constructed via adversarial filtering against ChatGPT-0613, which poses more challenges for evaluators based on ChatGPT-0613 (we mentioned this point in the caption of Figure 4).

Additionally, I don't agree that "LLMBar is the only benchmark" relevant to what you discussed. I would like to share this paper: How is ChatGPT's behavior changing over time?, which might be of interest to you.

princeton-nlp / LLMBar

Showing model degradation with LLMBar? #4