Closed Juhywcy closed 2 weeks ago
Hi! Thanks for your question! We used our internal eval implementation to generate those metrics instead of relying on the public lm_evaluation_harness library. Here is a summary of our eval details and we also published the evaluation result details as datasets in the Llama 3.1 Evals Hugging Face collections for you to review. Meanwhile we are working hard to see if we can use codes from lm_evaluation_harness to load our eval detail datasets and reproduce our results. Please stay tuned.
Hi! Thanks for your question! We used our internal eval implementation to generate those metrics instead of relying on the public lm_evaluation_harness library. Here is a summary of our eval details and we also published the evaluation result details as datasets in the Llama 3.1 Evals Hugging Face collections for you to review. Meanwhile we are working hard to see if we can use codes from lm_evaluation_harness to load our eval detail datasets and reproduce our results. Please stay tuned.你好!感谢您的提问!我们使用内部评估实现来生成这些指标,而不是依赖公共lm_evaluation_harness库。以下是我们评估详细信息的摘要,我们还将评估结果详细信息作为数据集发布在 Llama 3.1 Evals Hugging Face 集合中供您查看。与此同时,我们正在努力工作,看看是否可以使用lm_evaluation_harness中的代码来加载我们的评估详细信息数据集并重现我们的结果。敬请关注。
Thanks for your reply!! Are there some codes to reproduce the evaluation results? I want to test my method.
Hi! Thanks for your question! We used our internal eval implementation to generate those metrics instead of relying on the public lm_evaluation_harness library. Here is a summary of our eval details and we also published the evaluation result details as datasets in the Llama 3.1 Evals Hugging Face collections for you to review. Meanwhile we are working hard to see if we can use codes from lm_evaluation_harness to load our eval detail datasets and reproduce our results. Please stay tuned.你好!感谢您的提问!我们使用内部评估实现来生成这些指标,而不是依赖公共lm_evaluation_harness库。以下是我们评估详细信息的摘要,我们还将评估结果详细信息作为数据集发布在 Llama 3.1 Evals Hugging Face 集合中供您查看。与此同时,我们正在努力工作,看看是否可以使用lm_evaluation_harness中的代码来加载我们的评估详细信息数据集并重现我们的结果。敬请关注。
I would very much like to reproduce your results, which are very important for our research. Thanks.
Hi! Thanks for your question! We used our internal eval implementation to generate those metrics instead of relying on the public lm_evaluation_harness library. Here is a summary of our eval details and we also published the evaluation result details as datasets in the Llama 3.1 Evals Hugging Face collections for you to review. Meanwhile we are working hard to see if we can use codes from lm_evaluation_harness to load our eval detail datasets and reproduce our results. Please stay tuned.你好!感谢您的提问!我们使用内部评估实现来生成这些指标,而不是依赖公共lm_evaluation_harness库。以下是我们评估详细信息的摘要,我们还将评估结果详细信息作为数据集发布在 Llama 3.1 Evals Hugging Face 集合中供您查看。与此同时,我们正在努力工作,看看是否可以使用lm_evaluation_harness中的代码来加载我们的评估详细信息数据集并重现我们的结果。敬请关注。
Thanks for your reply!! Are there some codes to reproduce the evaluation results? I want to test my method.
Hi! There are 42 task details we published in this release, do you have particular tasks you want to reproduce for your research?
Hi! Thanks for your question! We used our internal eval implementation to generate those metrics instead of relying on the public lm_evaluation_harness library. Here is a summary of our eval details and we also published the evaluation result details as datasets in the Llama 3.1 Evals Hugging Face collections for you to review. Meanwhile we are working hard to see if we can use codes from lm_evaluation_harness to load our eval detail datasets and reproduce our results. Please stay tuned.你好!感谢您的提问!我们使用内部评估实现来生成这些指标,而不是依赖公共lm_evaluation_harness库。以下是我们评估详细信息的摘要,我们还将评估结果详细信息作为数据集发布在 Llama 3.1 Evals Hugging Face 集合中供您查看。与此同时,我们正在努力工作,看看是否可以使用lm_evaluation_harness中的代码来加载我们的评估详细信息数据集并重现我们的结果。敬请关注。
Thanks for your reply!! Are there some codes to reproduce the evaluation results? I want to test my method.
Hi! There are 42 task details we published in this release, do you have particular tasks you want to reproduce for your research? 重试 错误原因
yeah, I want to reproduce the tasks of the llama3.1 report, and I just find the evaluation log in the url you give me. can you give some codes to reproduce the results? I will thank you a lot.
@Juhywcy please let us know if the PR above/latest code doesn't answer your Qs. Thanks for raising the issue!
@Juhywcy We have developed a eval reproduce recipe to run our published 3.1 evals Hugging Face datasets with lm-evaluation-harness. Please take a look and hopefully it can be helpful to you.
System Info
[pip3] numpy==1.26.3 [pip3] torch==2.3.1+cu121 [pip3] torchaudio==2.3.1+cu121 [pip3] torchvision==0.18.1+cu121 [pip3] triton==2.3.1 [conda] numpy 1.26.3 pypi_0 pypi [conda] torch 2.3.1+cu121 pypi_0 pypi [conda] torchaudio 2.3.1+cu121 pypi_0 pypi [conda] torchvision 0.18.1+cu121 pypi_0 pypi [conda] triton 2.3.1 pypi_0 pypi
Information
🐛 Describe the bug
I follow the readme, but I get the low scores. I download the llama-recipes/tools/benchmarks /llm_eval_harness/, and install the lm_evaluation_harness in this folder.Then I run the tools/benchmarks/llm_eval_harness/open_llm_eval_prep.sh and eval.py. How to repro the results correctly?
Error logs
Expected behavior
the results as the llama3.1 report.