pfnet-research / japanese-lm-fin-harness

Japanese Language Model Financial Evaluation Harness
MIT License
51 stars 4 forks source link

No normalization for sum of log probs? #3

Closed omihub777 closed 2 months ago

omihub777 commented 4 months ago

Thank you for this remarkable work! I'm genuinely excited about this project. While exploring it, I'm curious about how you determine the choices LLMs make among multiple options, as discussed in the article "Multiple Choice Normalization in LM Evaluation."

In the evaluator.py file, specifically at #L320, it appears that the scores in requests are not normalized for the length of the choice sentences. This approach seems to favor shorter choices, especially in the cma_basics task, which calculates the sum of log probabilities using the full sentences of each choice rather than simpler symbols (e.g., "1" or "○"). Could you explain if there's a specific reason or rationale for utilizing unnormalized scores?

Please correct me if my understanding is off the mark. Once again, I appreciate your dedication to this project!

masanorihirano commented 4 months ago

Currently, this benchmark depends on lm-evaluation-harness and lm_harness v3.0.

Those have the same issue: https://github.com/Stability-AI/lm-evaluation-harness/blob/jp-stable/lm_eval/evaluator.py#L307C9-L307C54 https://github.com/EleutherAI/lm-evaluation-harness/blob/v0.3.0/lm_eval/evaluator.py#L265

What is strange thing is why lm_haness made by Eleuther does not employ it? Even Eleuther thinks the article is not correct? If there are no concrete answers and its consensus for this issue, I prefer to select the popular way to implement.

omihub777 commented 4 months ago

@masanorihirano Thank you for getting back to me.

If there are no concrete answers and its consensus for this issue, I prefer to select the popular way to implement.

In the implementation of lm-evaluation-harness from StabilityAI, they seem to adopt acc_norm, which is normalized by the character length of each choice as well as unnormalized one (i.e. acc).

Also, in LLaMA paper from Meta, they mention in section 3 Main Results:

We follow Gao et al. (2021) and use the likelihood normalized by the number of characters in the completion, except for certain datasets (OpenBookQA, BoolQ), for which we follow Brown et al. (2020), and select a completion based on the likelihood normalized by the likelihood of the completion given “Answer:” as context: P(completion|context)/P(completion|“Answer:”).

So, would it be great to report both acc_norm and acc?

I'm new to this kind of thing and still struggling with what is "the standard way" to evaluate, so I love to hear what you think.

masanorihirano commented 4 months ago

OK. I totally understand what you meant. I mistakenly thought you suggested that normalization should be implemented to evaluator.py . The current problem you're pointing out is the implementation of process_results in each task. base.py you mentioned is usually overridden by each task class. For example, jaqket_v1.py is supporting acc_norm. However, jsquad.py does not support normalization score such as f1_norm.

Moreover, the normalized score of jaqket_v1.py is not according to "Multiple Choice Normalization in LM Evaluation.". The current implementation of the normalization is just dividing the score by the length of the text. But, it should be the length of tokens, which depends on the tokenizer of each model. Currently, the task class in lm_harness can not access the tokenizers of each model. Therefore, accurate implementation of normalized scores is not easy.

Summarizing the above, implementing normalized scores using the same implementations as acc_norm is possible, and I'll try to do so in the next release. However, I think it is impossible to use acc_norm as the final benchmark score.

omihub777 commented 4 months ago

Summarizing the above, implementing normalized scores using the same implementations as acc_norm is possible, and I'll try to do so in the next release.

Thank you for taking into account my suggestion!

Btw, you might already know it, but EleutherAI also normalizes the score by the number of chars instead of tokens for acc_norm on multiple-choice tasks (due to their design decision).

Considering this and LLaMA paper, it appears that normalizing scores by the number of chars is one of the "standard" ways to evaluate autoregressive LMs on multiple-choice tasks.

Looking forward to the next release.

masanorihirano commented 4 months ago

Thank you for your suggestion. I'll consider refactoring the code according to your precious comments.

masanorihirano commented 2 months ago

Length normalized results are also available on v0.2.0.