princeton-nlp / LLMBar

[ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following
https://arxiv.org/abs/2310.07641
MIT License
105 stars 5 forks source link

Showing model degradation with LLMBar? #4

Closed DavidFarago closed 3 weeks ago

DavidFarago commented 1 month ago

In the paper, the Tables 5 vs 6 show that there is a model degradation of ChatGPT from the older 0301 to the younger 0613. Unfortunately, I did not see that mentioned in the paper.

It would be awesome to understand why that is the case, and why this benchmark is the only one revealing that the urban legend of GPT models degrading is actually true.

This is related to https://github.com/princeton-nlp/LLMBar/issues/3 as that issue/feature request would help to check model degradations, too.

Zhiyuan-Zeng commented 1 month ago

Hi @DavidFarago, thank you for your interest and great observation!

Yes, we agree that GPT-3.5-turbo-0301 seems to perform better than GPT-3.5-turbo-0613 on LLMBar. The reasons for this might be complicated, and the "urban legend" you mentioned could be just one of them. Please also note that the ADVERSARIAL set is constructed via adversarial filtering against ChatGPT-0613, which poses more challenges for evaluators based on ChatGPT-0613 (we mentioned this point in the caption of Figure 4).

Additionally, I don't agree that "LLMBar is the only benchmark" relevant to what you discussed. I would like to share this paper: How is ChatGPT's behavior changing over time?, which might be of interest to you.