Closed omihub777 closed 2 months ago
Currently, this benchmark depends on lm-evaluation-harness and lm_harness v3.0.
Those have the same issue: https://github.com/Stability-AI/lm-evaluation-harness/blob/jp-stable/lm_eval/evaluator.py#L307C9-L307C54 https://github.com/EleutherAI/lm-evaluation-harness/blob/v0.3.0/lm_eval/evaluator.py#L265
What is strange thing is why lm_haness made by Eleuther does not employ it? Even Eleuther thinks the article is not correct? If there are no concrete answers and its consensus for this issue, I prefer to select the popular way to implement.
@masanorihirano Thank you for getting back to me.
If there are no concrete answers and its consensus for this issue, I prefer to select the popular way to implement.
In the implementation of lm-evaluation-harness from StabilityAI, they seem to adopt acc_norm
, which is normalized by the character length of each choice as well as unnormalized one (i.e. acc
).
Also, in LLaMA paper from Meta, they mention in section 3 Main Results:
We follow Gao et al. (2021) and use the likelihood normalized by the number of characters in the completion, except for certain datasets (OpenBookQA, BoolQ), for which we follow Brown et al. (2020), and select a completion based on the likelihood normalized by the likelihood of the completion given “Answer:” as context: P(completion|context)/P(completion|“Answer:”).
So, would it be great to report both acc_norm
and acc
?
I'm new to this kind of thing and still struggling with what is "the standard way" to evaluate, so I love to hear what you think.
OK. I totally understand what you meant. I mistakenly thought you suggested that normalization should be implemented to evaluator.py
.
The current problem you're pointing out is the implementation of process_results in each task. base.py you mentioned is usually overridden by each task class.
For example, jaqket_v1.py is supporting acc_norm. However, jsquad.py does not support normalization score such as f1_norm.
Moreover, the normalized score of jaqket_v1.py is not according to "Multiple Choice Normalization in LM Evaluation.". The current implementation of the normalization is just dividing the score by the length of the text. But, it should be the length of tokens, which depends on the tokenizer of each model. Currently, the task class in lm_harness can not access the tokenizers of each model. Therefore, accurate implementation of normalized scores is not easy.
Summarizing the above, implementing normalized scores using the same implementations as acc_norm is possible, and I'll try to do so in the next release. However, I think it is impossible to use acc_norm as the final benchmark score.
Summarizing the above, implementing normalized scores using the same implementations as acc_norm is possible, and I'll try to do so in the next release.
Thank you for taking into account my suggestion!
Btw, you might already know it, but EleutherAI also normalizes the score by the number of chars instead of tokens for acc_norm
on multiple-choice tasks (due to their design decision).
Considering this and LLaMA paper, it appears that normalizing scores by the number of chars is one of the "standard" ways to evaluate autoregressive LMs on multiple-choice tasks.
Looking forward to the next release.
Thank you for your suggestion. I'll consider refactoring the code according to your precious comments.
Length normalized results are also available on v0.2.0.
Thank you for this remarkable work! I'm genuinely excited about this project. While exploring it, I'm curious about how you determine the choices LLMs make among multiple options, as discussed in the article "Multiple Choice Normalization in LM Evaluation."
In the
evaluator.py
file, specifically at #L320, it appears that the scores inrequests
are not normalized for the length of the choice sentences. This approach seems to favor shorter choices, especially in thecma_basics
task, which calculates the sum of log probabilities using the full sentences of each choice rather than simpler symbols (e.g., "1" or "○"). Could you explain if there's a specific reason or rationale for utilizing unnormalized scores?Please correct me if my understanding is off the mark. Once again, I appreciate your dedication to this project!