meta-llama / llama-recipes

Scripts for fine-tuning Meta Llama3 with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization and Q&A. Supporting a number of candid inference solutions such as HF TGI, VLLM for local or cloud deployment. Demo apps to showcase Meta Llama3 for WhatsApp & Messenger.
11.53k stars 1.63k forks source link

can't reproduce llama3.1 evaluation results #613

Closed Juhywcy closed 2 weeks ago

Juhywcy commented 1 month ago

System Info

[pip3] numpy==1.26.3 [pip3] torch==2.3.1+cu121 [pip3] torchaudio==2.3.1+cu121 [pip3] torchvision==0.18.1+cu121 [pip3] triton==2.3.1 [conda] numpy 1.26.3 pypi_0 pypi [conda] torch 2.3.1+cu121 pypi_0 pypi [conda] torchaudio 2.3.1+cu121 pypi_0 pypi [conda] torchvision 0.18.1+cu121 pypi_0 pypi [conda] triton 2.3.1 pypi_0 pypi

Information

🐛 Describe the bug

I follow the readme, but I get the low scores. I download the llama-recipes/tools/benchmarks /llm_eval_harness/, and install the lm_evaluation_harness in this folder.Then I run the tools/benchmarks/llm_eval_harness/open_llm_eval_prep.sh and eval.py. How to repro the results correctly?

Error logs

2024-07-28:11:41:09,270 INFO [eval.py:85] Tasks Version Filter n-shot Metric Value Stderr
arc 25 shot 1 none 25 acc_norm 0.3000 ± 0.0461
gsm8k 3 flexible-extract 5 exact_match 0.0300 ± 0.0171
strict-match 5 exact_match 0.0000 ± 0.0000
hellaswag 10 shot 1 none 10 acc_norm 0.4200 ± 0.0496
mmlu 1 none acc 0.2354 ± 0.0056
- humanities 1 none acc 0.2462 ± 0.0119
- formal_logic 0 none 0 acc 0.2300 ± 0.0423
- high_school_european_history 0 none 0 acc 0.2800 ± 0.0451
- high_school_us_history 0 none 0 acc 0.2400 ± 0.0429
- high_school_world_history 0 none 0 acc 0.3000 ± 0.0461
- international_law 0 none 0 acc 0.2500 ± 0.0435
- jurisprudence 0 none 0 acc 0.2900 ± 0.0456
- logical_fallacies 0 none 0 acc 0.1900 ± 0.0394
- moral_disputes 0 none 0 acc 0.2500 ± 0.0435
- moral_scenarios 0 none 0 acc 0.2200 ± 0.0416
- philosophy 0 none 0 acc 0.1600 ± 0.0368
- prehistory 0 none 0 acc 0.2300 ± 0.0423
- professional_law 0 none 0 acc 0.2200 ± 0.0416
- world_religions 0 none 0 acc 0.3400 ± 0.0476
- other 1 none acc 0.2354 ± 0.0117
- business_ethics 0 none 0 acc 0.3200 ± 0.0469
- clinical_knowledge 0 none 0 acc 0.1400 ± 0.0349
- college_medicine 0 none 0 acc 0.2100 ± 0.0409
- global_facts 0 none 0 acc 0.2100 ± 0.0409
- human_aging 0 none 0 acc 0.3100 ± 0.0465
- management 0 none 0 acc 0.2000 ± 0.0402
- marketing 0 none 0 acc 0.3400 ± 0.0476
- medical_genetics 0 none 0 acc 0.2800 ± 0.0451
- miscellaneous 0 none 0 acc 0.2100 ± 0.0409
- nutrition 0 none 0 acc 0.2300 ± 0.0423
- professional_accounting 0 none 0 acc 0.2100 ± 0.0409
- professional_medicine 0 none 0 acc 0.1500 ± 0.0359
- virology 0 none 0 acc 0.2500 ± 0.0435
- social sciences 1 none acc 0.2258 ± 0.0121
- econometrics 0 none 0 acc 0.2800 ± 0.0451
- high_school_geography 0 none 0 acc 0.1600 ± 0.0368
- high_school_government_and_politics 0 none 0 acc 0.1700 ± 0.0378
- high_school_macroeconomics 0 none 0 acc 0.1600 ± 0.0368
- high_school_microeconomics 0 none 0 acc 0.2200 ± 0.0416
- high_school_psychology 0 none 0 acc 0.2200 ± 0.0416
- human_sexuality 0 none 0 acc 0.2500 ± 0.0435
- professional_psychology 0 none 0 acc 0.2300 ± 0.0423
- public_relations 0 none 0 acc 0.2200 ± 0.0416
- security_studies 0 none 0 acc 0.2300 ± 0.0423
- sociology 0 none 0 acc 0.2900 ± 0.0456
- us_foreign_policy 0 none 0 acc 0.2800 ± 0.0451
- stem 1 none acc 0.2342 ± 0.0097
- abstract_algebra 0 none 0 acc 0.2300 ± 0.0423
- anatomy 0 none 0 acc 0.1900 ± 0.0394
- astronomy 0 none 0 acc 0.2100 ± 0.0409
- college_biology 0 none 0 acc 0.2500 ± 0.0435
- college_chemistry 0 none 0 acc 0.2500 ± 0.0435
- college_computer_science 0 none 0 acc 0.2600 ± 0.0441
- college_mathematics 0 none 0 acc 0.2000 ± 0.0402
- college_physics 0 none 0 acc 0.2300 ± 0.0423
- computer_security 0 none 0 acc 0.3000 ± 0.0461
- conceptual_physics 0 none 0 acc 0.3300 ± 0.0473
- electrical_engineering 0 none 0 acc 0.2500 ± 0.0435
- elementary_mathematics 0 none 0 acc 0.2500 ± 0.0435
- high_school_biology 0 none 0 acc 0.1300 ± 0.0338
- high_school_chemistry 0 none 0 acc 0.2000 ± 0.0402
- high_school_computer_science 0 none 0 acc 0.2900 ± 0.0456
- high_school_mathematics 0 none 0 acc 0.2200 ± 0.0416
- high_school_physics 0 none 0 acc 0.2000 ± 0.0402
- high_school_statistics 0 none 0 acc 0.1600 ± 0.0368
- machine_learning 0 none 0 acc 0.3000 ± 0.0461
truthfulqa_mc2 2 none 0 acc 0.5092 ± 0.0455
winogrande 5 shot 1 none 5 acc 0.5500 ± 0.0500

Expected behavior

the results as the llama3.1 report.

wukaixingxp commented 1 month ago

Hi! Thanks for your question! We used our internal eval implementation to generate those metrics instead of relying on the public lm_evaluation_harness library. Here is a summary of our eval details and we also published the evaluation result details as datasets in the Llama 3.1 Evals Hugging Face collections for you to review. Meanwhile we are working hard to see if we can use codes from lm_evaluation_harness to load our eval detail datasets and reproduce our results. Please stay tuned.

Juhywcy commented 1 month ago

Hi! Thanks for your question! We used our internal eval implementation to generate those metrics instead of relying on the public lm_evaluation_harness library. Here is a summary of our eval details and we also published the evaluation result details as datasets in the Llama 3.1 Evals Hugging Face collections for you to review. Meanwhile we are working hard to see if we can use codes from lm_evaluation_harness to load our eval detail datasets and reproduce our results. Please stay tuned.你好!感谢您的提问!我们使用内部评估实现来生成这些指标,而不是依赖公共lm_evaluation_harness库。以下是我们评估详细信息的摘要,我们还将评估结果详细信息作为数据集发布在 Llama 3.1 Evals Hugging Face 集合中供您查看。与此同时,我们正在努力工作,看看是否可以使用lm_evaluation_harness中的代码来加载我们的评估详细信息数据集并重现我们的结果。敬请关注。

Thanks for your reply!! Are there some codes to reproduce the evaluation results? I want to test my method.

Juhywcy commented 1 month ago

Hi! Thanks for your question! We used our internal eval implementation to generate those metrics instead of relying on the public lm_evaluation_harness library. Here is a summary of our eval details and we also published the evaluation result details as datasets in the Llama 3.1 Evals Hugging Face collections for you to review. Meanwhile we are working hard to see if we can use codes from lm_evaluation_harness to load our eval detail datasets and reproduce our results. Please stay tuned.你好!感谢您的提问!我们使用内部评估实现来生成这些指标,而不是依赖公共lm_evaluation_harness库。以下是我们评估详细信息的摘要,我们还将评估结果详细信息作为数据集发布在 Llama 3.1 Evals Hugging Face 集合中供您查看。与此同时,我们正在努力工作,看看是否可以使用lm_evaluation_harness中的代码来加载我们的评估详细信息数据集并重现我们的结果。敬请关注。

I would very much like to reproduce your results, which are very important for our research. Thanks.

wukaixingxp commented 1 month ago

Hi! Thanks for your question! We used our internal eval implementation to generate those metrics instead of relying on the public lm_evaluation_harness library. Here is a summary of our eval details and we also published the evaluation result details as datasets in the Llama 3.1 Evals Hugging Face collections for you to review. Meanwhile we are working hard to see if we can use codes from lm_evaluation_harness to load our eval detail datasets and reproduce our results. Please stay tuned.你好!感谢您的提问!我们使用内部评估实现来生成这些指标,而不是依赖公共lm_evaluation_harness库。以下是我们评估详细信息的摘要,我们还将评估结果详细信息作为数据集发布在 Llama 3.1 Evals Hugging Face 集合中供您查看。与此同时,我们正在努力工作,看看是否可以使用lm_evaluation_harness中的代码来加载我们的评估详细信息数据集并重现我们的结果。敬请关注。

Thanks for your reply!! Are there some codes to reproduce the evaluation results? I want to test my method.

Hi! There are 42 task details we published in this release, do you have particular tasks you want to reproduce for your research?

Juhywcy commented 1 month ago

Hi! Thanks for your question! We used our internal eval implementation to generate those metrics instead of relying on the public lm_evaluation_harness library. Here is a summary of our eval details and we also published the evaluation result details as datasets in the Llama 3.1 Evals Hugging Face collections for you to review. Meanwhile we are working hard to see if we can use codes from lm_evaluation_harness to load our eval detail datasets and reproduce our results. Please stay tuned.你好!感谢您的提问!我们使用内部评估实现来生成这些指标,而不是依赖公共lm_evaluation_harness库。以下是我们评估详细信息的摘要,我们还将评估结果详细信息作为数据集发布在 Llama 3.1 Evals Hugging Face 集合中供您查看。与此同时,我们正在努力工作,看看是否可以使用lm_evaluation_harness中的代码来加载我们的评估详细信息数据集并重现我们的结果。敬请关注。

Thanks for your reply!! Are there some codes to reproduce the evaluation results? I want to test my method.

Hi! There are 42 task details we published in this release, do you have particular tasks you want to reproduce for your research? 重试    错误原因

yeah, I want to reproduce the tasks of the llama3.1 report, and I just find the evaluation log in the url you give me. can you give some codes to reproduce the results? I will thank you a lot.

init27 commented 2 weeks ago

@Juhywcy please let us know if the PR above/latest code doesn't answer your Qs. Thanks for raising the issue!

wukaixingxp commented 2 weeks ago

@Juhywcy We have developed a eval reproduce recipe to run our published 3.1 evals Hugging Face datasets with lm-evaluation-harness. Please take a look and hopefully it can be helpful to you.