can't reproduce llama3.1 evaluation results

Juhywcy commented 1 month ago

System Info

[pip3] numpy==1.26.3 [pip3] torch==2.3.1+cu121 [pip3] torchaudio==2.3.1+cu121 [pip3] torchvision==0.18.1+cu121 [pip3] triton==2.3.1 [conda] numpy 1.26.3 pypi_0 pypi [conda] torch 2.3.1+cu121 pypi_0 pypi [conda] torchaudio 2.3.1+cu121 pypi_0 pypi [conda] torchvision 0.18.1+cu121 pypi_0 pypi [conda] triton 2.3.1 pypi_0 pypi

Information

[ ] The official example scripts
[ ] My own modified scripts

🐛 Describe the bug

I follow the readme, but I get the low scores. I download the llama-recipes/tools/benchmarks /llm_eval_harness/, and install the lm_evaluation_harness in this folder.Then I run the tools/benchmarks/llm_eval_harness/open_llm_eval_prep.sh and eval.py. How to repro the results correctly?

Error logs

2024-07-28:11:41:09,270 INFO [eval.py:85]	Tasks	Version	Filter	n-shot	Metric		Value
arc 25 shot	1	none	25	acc_norm	↑	0.3000	±	0.0461
gsm8k	3	flexible-extract	5	exact_match	↑	0.0300	±	0.0171
		strict-match	5	exact_match	↑	0.0000	±	0.0000
hellaswag 10 shot	1	none	10	acc_norm	↑	0.4200	±	0.0496
mmlu	1	none		acc	↑	0.2354	±	0.0056
- humanities	1	none		acc	↑	0.2462	±	0.0119
- formal_logic	0	none	0	acc	↑	0.2300	±	0.0423
- high_school_european_history	0	none	0	acc	↑	0.2800	±	0.0451
- high_school_us_history	0	none	0	acc	↑	0.2400	±	0.0429
- high_school_world_history	0	none	0	acc	↑	0.3000	±	0.0461
- international_law	0	none	0	acc	↑	0.2500	±	0.0435
- jurisprudence	0	none	0	acc	↑	0.2900	±	0.0456
- logical_fallacies	0	none	0	acc	↑	0.1900	±	0.0394
- moral_disputes	0	none	0	acc	↑	0.2500	±	0.0435
- moral_scenarios	0	none	0	acc	↑	0.2200	±	0.0416
- philosophy	0	none	0	acc	↑	0.1600	±	0.0368
- prehistory	0	none	0	acc	↑	0.2300	±	0.0423
- professional_law	0	none	0	acc	↑	0.2200	±	0.0416
- world_religions	0	none	0	acc	↑	0.3400	±	0.0476
- other	1	none		acc	↑	0.2354	±	0.0117
- business_ethics	0	none	0	acc	↑	0.3200	±	0.0469
- clinical_knowledge	0	none	0	acc	↑	0.1400	±	0.0349
- college_medicine	0	none	0	acc	↑	0.2100	±	0.0409
- global_facts	0	none	0	acc	↑	0.2100	±	0.0409
- human_aging	0	none	0	acc	↑	0.3100	±	0.0465
- management	0	none	0	acc	↑	0.2000	±	0.0402
- marketing	0	none	0	acc	↑	0.3400	±	0.0476
- medical_genetics	0	none	0	acc	↑	0.2800	±	0.0451
- miscellaneous	0	none	0	acc	↑	0.2100	±	0.0409
- nutrition	0	none	0	acc	↑	0.2300	±	0.0423
- professional_accounting	0	none	0	acc	↑	0.2100	±	0.0409
- professional_medicine	0	none	0	acc	↑	0.1500	±	0.0359
- virology	0	none	0	acc	↑	0.2500	±	0.0435
- social sciences	1	none		acc	↑	0.2258	±	0.0121
- econometrics	0	none	0	acc	↑	0.2800	±	0.0451
- high_school_geography	0	none	0	acc	↑	0.1600	±	0.0368
- high_school_government_and_politics	0	none	0	acc	↑	0.1700	±	0.0378
- high_school_macroeconomics	0	none	0	acc	↑	0.1600	±	0.0368
- high_school_microeconomics	0	none	0	acc	↑	0.2200	±	0.0416
- high_school_psychology	0	none	0	acc	↑	0.2200	±	0.0416
- human_sexuality	0	none	0	acc	↑	0.2500	±	0.0435
- professional_psychology	0	none	0	acc	↑	0.2300	±	0.0423
- public_relations	0	none	0	acc	↑	0.2200	±	0.0416
- security_studies	0	none	0	acc	↑	0.2300	±	0.0423
- sociology	0	none	0	acc	↑	0.2900	±	0.0456
- us_foreign_policy	0	none	0	acc	↑	0.2800	±	0.0451
- stem	1	none		acc	↑	0.2342	±	0.0097
- abstract_algebra	0	none	0	acc	↑	0.2300	±	0.0423
- anatomy	0	none	0	acc	↑	0.1900	±	0.0394
- astronomy	0	none	0	acc	↑	0.2100	±	0.0409
- college_biology	0	none	0	acc	↑	0.2500	±	0.0435
- college_chemistry	0	none	0	acc	↑	0.2500	±	0.0435
- college_computer_science	0	none	0	acc	↑	0.2600	±	0.0441
- college_mathematics	0	none	0	acc	↑	0.2000	±	0.0402
- college_physics	0	none	0	acc	↑	0.2300	±	0.0423
- computer_security	0	none	0	acc	↑	0.3000	±	0.0461
- conceptual_physics	0	none	0	acc	↑	0.3300	±	0.0473
- electrical_engineering	0	none	0	acc	↑	0.2500	±	0.0435
- elementary_mathematics	0	none	0	acc	↑	0.2500	±	0.0435
- high_school_biology	0	none	0	acc	↑	0.1300	±	0.0338
- high_school_chemistry	0	none	0	acc	↑	0.2000	±	0.0402
- high_school_computer_science	0	none	0	acc	↑	0.2900	±	0.0456
- high_school_mathematics	0	none	0	acc	↑	0.2200	±	0.0416
- high_school_physics	0	none	0	acc	↑	0.2000	±	0.0402
- high_school_statistics	0	none	0	acc	↑	0.1600	±	0.0368
- machine_learning	0	none	0	acc	↑	0.3000	±	0.0461
truthfulqa_mc2	2	none	0	acc	↑	0.5092	±	0.0455
winogrande 5 shot	1	none	5	acc	↑	0.5500	±	0.0500

Expected behavior

the results as the llama3.1 report.

wukaixingxp commented 1 month ago

Hi! Thanks for your question! We used our internal eval implementation to generate those metrics instead of relying on the public lm_evaluation_harness library. Here is a summary of our eval details and we also published the evaluation result details as datasets in the Llama 3.1 Evals Hugging Face collections for you to review. Meanwhile we are working hard to see if we can use codes from lm_evaluation_harness to load our eval detail datasets and reproduce our results. Please stay tuned.

Juhywcy commented 1 month ago

Hi! Thanks for your question! We used our internal eval implementation to generate those metrics instead of relying on the public lm_evaluation_harness library. Here is a summary of our eval details and we also published the evaluation result details as datasets in the Llama 3.1 Evals Hugging Face collections for you to review. Meanwhile we are working hard to see if we can use codes from lm_evaluation_harness to load our eval detail datasets and reproduce our results. Please stay tuned.你好！感谢您的提问！我们使用内部评估实现来生成这些指标，而不是依赖公共lm_evaluation_harness库。以下是我们评估详细信息的摘要，我们还将评估结果详细信息作为数据集发布在 Llama 3.1 Evals Hugging Face 集合中供您查看。与此同时，我们正在努力工作，看看是否可以使用lm_evaluation_harness中的代码来加载我们的评估详细信息数据集并重现我们的结果。敬请关注。

Thanks for your reply!! Are there some codes to reproduce the evaluation results? I want to test my method.

Juhywcy commented 1 month ago

Hi! Thanks for your question! We used our internal eval implementation to generate those metrics instead of relying on the public lm_evaluation_harness library. Here is a summary of our eval details and we also published the evaluation result details as datasets in the Llama 3.1 Evals Hugging Face collections for you to review. Meanwhile we are working hard to see if we can use codes from lm_evaluation_harness to load our eval detail datasets and reproduce our results. Please stay tuned.你好！感谢您的提问！我们使用内部评估实现来生成这些指标，而不是依赖公共lm_evaluation_harness库。以下是我们评估详细信息的摘要，我们还将评估结果详细信息作为数据集发布在 Llama 3.1 Evals Hugging Face 集合中供您查看。与此同时，我们正在努力工作，看看是否可以使用lm_evaluation_harness中的代码来加载我们的评估详细信息数据集并重现我们的结果。敬请关注。

I would very much like to reproduce your results, which are very important for our research. Thanks.

wukaixingxp commented 1 month ago

Hi! Thanks for your question! We used our internal eval implementation to generate those metrics instead of relying on the public lm_evaluation_harness library. Here is a summary of our eval details and we also published the evaluation result details as datasets in the Llama 3.1 Evals Hugging Face collections for you to review. Meanwhile we are working hard to see if we can use codes from lm_evaluation_harness to load our eval detail datasets and reproduce our results. Please stay tuned.你好！感谢您的提问！我们使用内部评估实现来生成这些指标，而不是依赖公共lm_evaluation_harness库。以下是我们评估详细信息的摘要，我们还将评估结果详细信息作为数据集发布在 Llama 3.1 Evals Hugging Face 集合中供您查看。与此同时，我们正在努力工作，看看是否可以使用lm_evaluation_harness中的代码来加载我们的评估详细信息数据集并重现我们的结果。敬请关注。

Thanks for your reply!! Are there some codes to reproduce the evaluation results? I want to test my method.

Hi! There are 42 task details we published in this release, do you have particular tasks you want to reproduce for your research?

Juhywcy commented 1 month ago

Hi! Thanks for your question! We used our internal eval implementation to generate those metrics instead of relying on the public lm_evaluation_harness library. Here is a summary of our eval details and we also published the evaluation result details as datasets in the Llama 3.1 Evals Hugging Face collections for you to review. Meanwhile we are working hard to see if we can use codes from lm_evaluation_harness to load our eval detail datasets and reproduce our results. Please stay tuned.你好！感谢您的提问！我们使用内部评估实现来生成这些指标，而不是依赖公共lm_evaluation_harness库。以下是我们评估详细信息的摘要，我们还将评估结果详细信息作为数据集发布在 Llama 3.1 Evals Hugging Face 集合中供您查看。与此同时，我们正在努力工作，看看是否可以使用lm_evaluation_harness中的代码来加载我们的评估详细信息数据集并重现我们的结果。敬请关注。

Thanks for your reply!! Are there some codes to reproduce the evaluation results? I want to test my method.

Hi! There are 42 task details we published in this release, do you have particular tasks you want to reproduce for your research? 重试错误原因

yeah, I want to reproduce the tasks of the llama3.1 report, and I just find the evaluation log in the url you give me. can you give some codes to reproduce the results? I will thank you a lot.

init27 commented 2 weeks ago

@Juhywcy please let us know if the PR above/latest code doesn't answer your Qs. Thanks for raising the issue!

wukaixingxp commented 2 weeks ago

@Juhywcy We have developed a eval reproduce recipe to run our published 3.1 evals Hugging Face datasets with lm-evaluation-harness. Please take a look and hopefully it can be helpful to you.

meta-llama / llama-recipes