Open Datoow opened 2 weeks ago
We have updated the summarizer
, making it easier to obtain group results for MathBench A&T. Please refer to the updated README for guidance.
For Llama-3-8B-Instruct, here is what I obtained:
MathBench Accuracy
Category | Metric | Value |
---|---|---|
mathbench-new-college-single_choice_cn | acc_1 | 35.33 |
mathbench-new-college-single_choice_en | acc_1 | 46.67 |
mathbench-new-high-single_choice_cn | acc_1 | 44.67 |
mathbench-new-high-single_choice_en | acc_1 | 51.33 |
mathbench-new-middle-single_choice_cn | acc_1 | 54.00 |
mathbench-new-middle-single_choice_en | acc_1 | 44.67 |
mathbench-new-primary-cloze_cn | accuracy | 65.33 |
mathbench-new-primary-cloze_en | accuracy | 76.67 |
mathbench-new-calculate-cloze_en | accuracy | 54.67 |
MathBench CircularEval
Category | Metric | Value |
---|---|---|
mathbench-new-college-single_choice_cn | perf_4 | 9.33 |
mathbench-new-college-single_choice_en | perf_4 | 18.67 |
mathbench-new-high-single_choice_cn | perf_4 | 10.67 |
mathbench-new-high-single_choice_en | perf_4 | 27.33 |
mathbench-new-middle-single_choice_cn | perf_4 | 28.67 |
mathbench-new-middle-single_choice_en | perf_4 | 21.33 |
mathbench-new-primary-cloze_cn | accuracy | 65.33 |
mathbench-new-primary-cloze_en | accuracy | 76.67 |
mathbench-new-calculate-cloze_en | accuracy | 54.67 |
And for MathBench-T:
MathBench CircularEval
Category | Metric | Value |
---|---|---|
mathbench-knowledge-college-single_choice_cn | perf_4 | 43.35 |
mathbench-knowledge-college-single_choice_en | perf_4 | 63.92 |
mathbench-knowledge-high-single_choice_cn | perf_4 | 33.19 |
mathbench-knowledge-high-single_choice_en | perf_4 | 53.74 |
mathbench-knowledge-middle-single_choice_cn | perf_4 | 36.23 |
mathbench-knowledge-middle-single_choice_en | perf_4 | 66.29 |
mathbench-knowledge-primary-single_choice_cn | perf_4 | 44.23 |
mathbench-knowledge-primary-single_choice_en | perf_4 | 76.15 |
MathBench Knowledge Accuracy
Category | Metric | Value |
---|---|---|
mathbench-college_knowledge-single_choice_cn | acc_1 | 67.41 |
mathbench-college_knowledge-single_choice_en | acc_1 | 78.80 |
mathbench-high_knowledge-single_choice_cn | acc_1 | 62.55 |
mathbench-high_knowledge-single_choice_en | acc_1 | 72.95 |
mathbench-middle_knowledge-single_choice_cn | acc_1 | 63.47 |
mathbench-middle_knowledge-single_choice_en | acc_1 | 80.57 |
mathbench-primary_knowledge-single_choice_cn | acc_1 | 68.27 |
mathbench-primary_knowledge-single_choice_en | acc_1 | 82.57 |
It appears that your results are lower than mine. Please ensure that you are using the latest dataset, released on May 14: https://github.com/open-compass/MathBench/releases/tag/v0.1.0
Thank you for your reply, while we do use the latest dataset... Is there anything else that maybe different? What is the prompt of the input? Could you provide the prediction files? ps. why our category names are different... Looking forward to your reply with thanks.
Can you give me some prediction samples, especially for single-choice questions?
If there is no cot with the prediction, please pull the newest main branch of opencompass and evaluate again, because the cot is set to None unexpectedly during the last version of opencompass.
I tried Llama-3-8B-Instruct for MathBench, the summary is:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ dataset version metric mode llama-3-8b-instruct-hf
######## MathBench Application Accuracy ######## - - - - mathbench-college-single_choice_cn 783703 acc_1 gen 38.00 mathbench-college-single_choice_en b0fb1b acc_1 gen 40.00 mathbench-high-single_choice_cn 783703 acc_1 gen 39.33 mathbench-high-single_choice_en b0fb1b acc_1 gen 38.67 mathbench-middle-single_choice_cn 783703 acc_1 gen 49.33 mathbench-middle-single_choice_en b0fb1b acc_1 gen 28.67 mathbench-primary-cloze_cn ea47a6 accuracy gen 63.33 mathbench-primary-cloze_en bcc9c6 accuracy gen 71.33 mathbench-arithmetic-cloze_en bcc9c6 accuracy gen 52.67 ######## MathBench Application CircularEval ######## - - - - mathbench-college-single_choice_cn 783703 perf_4 gen 9.33 mathbench-college-single_choice_en b0fb1b perf_4 gen 15.33 mathbench-high-single_choice_cn 783703 perf_4 gen 8.67 mathbench-high-single_choice_en b0fb1b perf_4 gen 12.67 mathbench-middle-single_choice_cn 783703 perf_4 gen 23.33 mathbench-middle-single_choice_en b0fb1b perf_4 gen 9.33 ######## MathBench Knowledge CircularEval ######## - - - - mathbench-college_knowledge-single_choice_cn 783703 perf_4 gen 52.85 mathbench-college_knowledge-single_choice_en b0fb1b perf_4 gen 66.77 mathbench-high_knowledge-single_choice_cn 783703 perf_4 gen 35.11 mathbench-high_knowledge-single_choice_en b0fb1b perf_4 gen 58.72 mathbench-middle_knowledge-single_choice_cn 783703 perf_4 gen 41.62 mathbench-middle_knowledge-single_choice_en b0fb1b perf_4 gen 64.57 mathbench-primary_knowledge-single_choice_cn 783703 perf_4 gen 37.98 mathbench-primary_knowledge-single_choice_en b0fb1b perf_4 gen 67.89 ######## MathBench Knowledge Accuracy ######## - - - - mathbench-college_knowledge-single_choice_cn 783703 acc_1 gen 71.52 mathbench-college_knowledge-single_choice_en b0fb1b acc_1 gen 77.22 mathbench-high_knowledge-single_choice_cn 783703 acc_1 gen 60.00 mathbench-high_knowledge-single_choice_en b0fb1b acc_1 gen 75.09 mathbench-middle_knowledge-single_choice_cn 783703 acc_1 gen 64.97 mathbench-middle_knowledge-single_choice_en b0fb1b acc_1 gen 81.71 mathbench-primary_knowledge-single_choice_cn 783703 acc_1 gen 64.42 mathbench-primary_knowledge-single_choice_en b0fb1b acc_1 gen 82.57 $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
I would like to ask how to calculate the model performance, for example, the Primary Application performance you gave for Llama-3-8B-Instruct is 71.0; but here the summary is 71.33 for primary-cloze_en and 63.33 for primary-cloze_cn; the average of these two is smaller... the Primary Theory performance you gave for Llama-3-8B-Instruct is 60.2; but here the summary is 67.89 for single_choice_en and 37.98 for single_choice_cn; the average of these two is much smaller...
and same with other performance for Llama-3-8B-Instruct.