OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
This PR introduces the implementation of P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs (see paper link). The P-MMEval benchmark delivers support for evaluating LLMs on multilingual capabilities with examples in 10 languages.
Modification
Configs:
Add files in configs/datasets/PMMEval for evaluation support. For each subset in P-MMEval (i.e., flores, humaneval-xl, mgsm, mhellaswag, mifeval, mlogiqa, mmmlu, and xnli), each dataset python file is created.
Add files in configs/summarizers and configs/summarizers/groups for summarizing the evaluation results on P-MMEval.
Datasets
Add files in datasets supporting the loading and evaluation for each subset.
Checklist
Before PR:
[x] Pre-commit or other linting tools are used to fix the potential lint issues.
[ ] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
[ ] The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
[ ] The documentation has been modified accordingly, like docstring or example tutorials.
After PR:
[ ] If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
[ ] CLA has been signed and all committers have signed the CLA in this PR.
Motivation
This PR introduces the implementation of P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs (see paper link). The P-MMEval benchmark delivers support for evaluating LLMs on multilingual capabilities with examples in 10 languages.
Modification
Checklist
Before PR:
After PR: