open-compass / MathBench

[ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
https://open-compass.github.io/MathBench/
Apache License 2.0
64 stars 1 forks source link

Questions about the model performance #21

Open Datoow opened 2 weeks ago

Datoow commented 2 weeks ago

I tried Llama-3-8B-Instruct for MathBench, the summary is:

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ dataset version metric mode llama-3-8b-instruct-hf


######## MathBench Application Accuracy ######## - - - - mathbench-college-single_choice_cn 783703 acc_1 gen 38.00 mathbench-college-single_choice_en b0fb1b acc_1 gen 40.00 mathbench-high-single_choice_cn 783703 acc_1 gen 39.33 mathbench-high-single_choice_en b0fb1b acc_1 gen 38.67 mathbench-middle-single_choice_cn 783703 acc_1 gen 49.33 mathbench-middle-single_choice_en b0fb1b acc_1 gen 28.67 mathbench-primary-cloze_cn ea47a6 accuracy gen 63.33 mathbench-primary-cloze_en bcc9c6 accuracy gen 71.33 mathbench-arithmetic-cloze_en bcc9c6 accuracy gen 52.67 ######## MathBench Application CircularEval ######## - - - - mathbench-college-single_choice_cn 783703 perf_4 gen 9.33 mathbench-college-single_choice_en b0fb1b perf_4 gen 15.33 mathbench-high-single_choice_cn 783703 perf_4 gen 8.67 mathbench-high-single_choice_en b0fb1b perf_4 gen 12.67 mathbench-middle-single_choice_cn 783703 perf_4 gen 23.33 mathbench-middle-single_choice_en b0fb1b perf_4 gen 9.33 ######## MathBench Knowledge CircularEval ######## - - - - mathbench-college_knowledge-single_choice_cn 783703 perf_4 gen 52.85 mathbench-college_knowledge-single_choice_en b0fb1b perf_4 gen 66.77 mathbench-high_knowledge-single_choice_cn 783703 perf_4 gen 35.11 mathbench-high_knowledge-single_choice_en b0fb1b perf_4 gen 58.72 mathbench-middle_knowledge-single_choice_cn 783703 perf_4 gen 41.62 mathbench-middle_knowledge-single_choice_en b0fb1b perf_4 gen 64.57 mathbench-primary_knowledge-single_choice_cn 783703 perf_4 gen 37.98 mathbench-primary_knowledge-single_choice_en b0fb1b perf_4 gen 67.89 ######## MathBench Knowledge Accuracy ######## - - - - mathbench-college_knowledge-single_choice_cn 783703 acc_1 gen 71.52 mathbench-college_knowledge-single_choice_en b0fb1b acc_1 gen 77.22 mathbench-high_knowledge-single_choice_cn 783703 acc_1 gen 60.00 mathbench-high_knowledge-single_choice_en b0fb1b acc_1 gen 75.09 mathbench-middle_knowledge-single_choice_cn 783703 acc_1 gen 64.97 mathbench-middle_knowledge-single_choice_en b0fb1b acc_1 gen 81.71 mathbench-primary_knowledge-single_choice_cn 783703 acc_1 gen 64.42 mathbench-primary_knowledge-single_choice_en b0fb1b acc_1 gen 82.57 $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

I would like to ask how to calculate the model performance, for example, the Primary Application performance you gave for Llama-3-8B-Instruct is 71.0; but here the summary is 71.33 for primary-cloze_en and 63.33 for primary-cloze_cn; the average of these two is smaller... the Primary Theory performance you gave for Llama-3-8B-Instruct is 60.2; but here the summary is 67.89 for single_choice_en and 37.98 for single_choice_cn; the average of these two is much smaller...

and same with other performance for Llama-3-8B-Instruct.

liushz commented 1 week ago

We have updated the summarizer, making it easier to obtain group results for MathBench A&T. Please refer to the updated README for guidance.

For Llama-3-8B-Instruct, here is what I obtained:

MathBench-A: llama-3-8b-instruct-hf


MathBench Accuracy

Category Metric Value
mathbench-new-college-single_choice_cn acc_1 35.33
mathbench-new-college-single_choice_en acc_1 46.67
mathbench-new-high-single_choice_cn acc_1 44.67
mathbench-new-high-single_choice_en acc_1 51.33
mathbench-new-middle-single_choice_cn acc_1 54.00
mathbench-new-middle-single_choice_en acc_1 44.67
mathbench-new-primary-cloze_cn accuracy 65.33
mathbench-new-primary-cloze_en accuracy 76.67
mathbench-new-calculate-cloze_en accuracy 54.67

MathBench CircularEval

Category Metric Value
mathbench-new-college-single_choice_cn perf_4 9.33
mathbench-new-college-single_choice_en perf_4 18.67
mathbench-new-high-single_choice_cn perf_4 10.67
mathbench-new-high-single_choice_en perf_4 27.33
mathbench-new-middle-single_choice_cn perf_4 28.67
mathbench-new-middle-single_choice_en perf_4 21.33
mathbench-new-primary-cloze_cn accuracy 65.33
mathbench-new-primary-cloze_en accuracy 76.67
mathbench-new-calculate-cloze_en accuracy 54.67

And for MathBench-T:


MathBench-T: llama-3-8b-instruct-hf

MathBench CircularEval

Category Metric Value
mathbench-knowledge-college-single_choice_cn perf_4 43.35
mathbench-knowledge-college-single_choice_en perf_4 63.92
mathbench-knowledge-high-single_choice_cn perf_4 33.19
mathbench-knowledge-high-single_choice_en perf_4 53.74
mathbench-knowledge-middle-single_choice_cn perf_4 36.23
mathbench-knowledge-middle-single_choice_en perf_4 66.29
mathbench-knowledge-primary-single_choice_cn perf_4 44.23
mathbench-knowledge-primary-single_choice_en perf_4 76.15

MathBench Knowledge Accuracy

Category Metric Value
mathbench-college_knowledge-single_choice_cn acc_1 67.41
mathbench-college_knowledge-single_choice_en acc_1 78.80
mathbench-high_knowledge-single_choice_cn acc_1 62.55
mathbench-high_knowledge-single_choice_en acc_1 72.95
mathbench-middle_knowledge-single_choice_cn acc_1 63.47
mathbench-middle_knowledge-single_choice_en acc_1 80.57
mathbench-primary_knowledge-single_choice_cn acc_1 68.27
mathbench-primary_knowledge-single_choice_en acc_1 82.57

It appears that your results are lower than mine. Please ensure that you are using the latest dataset, released on May 14: https://github.com/open-compass/MathBench/releases/tag/v0.1.0

Datoow commented 1 week ago

Thank you for your reply, while we do use the latest dataset... Is there anything else that maybe different? What is the prompt of the input? Could you provide the prediction files? ps. why our category names are different... Looking forward to your reply with thanks.

liushz commented 1 week ago

Can you give me some prediction samples, especially for single-choice questions?

liushz commented 1 week ago

If there is no cot with the prediction, please pull the newest main branch of opencompass and evaluate again, because the cot is set to None unexpectedly during the last version of opencompass.