Open yzlnew opened 8 months ago
You are right. The LLMs are sensitive to the prompt.
@tonysy Is it considered a bug of OpenCompass and getting fixed in future release? I've noticed several other datasets with prompt configured similarly, which could cause possible performance downgrade.
I think it is not a bug, it's the issue of the LLM other than the evaluation. Actually, we may need introduce several different prompts to improve the robustness of the evaluation.
@tonysy I agree with this view. However, I want to point out that OpenCompass could give different results compared to the original version, and the prompts change with different datasets, such as those with or without trailing whitespace.
But, we can partially fix this issue in the tokenization stage. Therefore, model using a tokenizer that additionally processes trailing whitespace results in higher scores on the leaderboard, but it does not reflect the true capability of the model.
Right, we are working on the prompt sensitivity and will provide multi-prompt result recently. Stay tuned.
@yzlnew It's a problem related to bpe dropout. Our paper has discussed this problem. https://arxiv.org/pdf/2404.03608
@longxudou Thanks. It seems like a simple but effective fix during the tokenization stage.
Prerequisite
Type
I have modified the code (config is not considered code), or I'm working on my own tasks/models/datasets.
Environment
Reproduces the problem - code/configuration sample
Evaluating my own model.
Reproduces the problem - command or script
Reproduces the problem - error message
None
Other information
I'm evaluating on AGIEval and notice a performance drop under default config. Dig into predictions, I find that model generates unusual tokens, like multi white spaces or "\n".
https://github.com/open-compass/opencompass/blob/ba7cd58da3317bdec233d097153e2ab92c5f5dd5/configs/datasets/agieval/agieval_gen_64afd3.py#L72
The issue is gone when I remove the trailing whitespace. It seems like an OOD problem when a base model tries to predict under a situation not seen in the pre-training stage, which is also mentioned in this video. Go back to the original repo of AGIEval, there're no trailing whitespaces.