This repository contains information about AGIEval, data, code and output of baseline systems for the benchmark.
AGIEval is a human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving. This benchmark is derived from 20 official, public, and high-standard admission and qualification exams intended for general human test-takers, such as general college admission tests (e.g., Chinese College Entrance Exam (Gaokao) and American SAT), law school admission tests, math competitions, lawyer qualification tests, and national civil service exams. For a full description of the benchmark, please refer to our paper: AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models.
We have updated the dataset to version 1.1. The new version updated Chinese Gaokao (chemistry, biology, physics) datasets with questions from 2023 and addressed annotation issues. To facilitate evaluation, now all multi-choice question (MCQ) tasks have one answer only (Gaokao-Physics and JEC-QA used to have multi-label answers). AGIEval-en datasets remain the same as Verison 1.0. The new version's statistics are as follows:
AGIEval v1.1 contains 20 tasks, including 18 MCQ tasks and two cloze tasks (Gaokao-Math-Cloze and MATH). You can find the full list of tasks in the table below.
You can download all post-processed data in the data/v1_1 folder. All usage of the data should follow the license of the original datasets.
The data format for all datasets is as follows:
{
"passage": null,
"question": "设集合 $A=\\{x \\mid x \\geq 1\\}, B=\\{x \\mid-1<x<2\\}$, 则 $A \\cap B=$ ($\\quad$)\\\\\n",
"options": ["(A)$\\{x \\mid x>-1\\}$",
"(B)$\\{x \\mid x \\geq 1\\}$",
"(C)$\\{x \\mid-1<x<1\\}$",
"(D)$\\{x \\mid 1 \\leq x<2\\}$"
],
"label": "D",
"answer": null
}
The passage
field is available for gaokao-chinese, gaokao-english, both of logiqa, all of LSAT, and SAT. The answer for multi-choice tasks is saved in the label
field. The answer for cloze tasks is saved in the answer
field.
We provide the prompts for few-shot learning in the data/few_shot_prompts file.
We evaluate the performance of the baseline systems (gpt-3.5-turbo and GPT-4o) on AGIEval v1.1. The results are as follows:
You can replicate the results by following the steps below:
You can run the post_process_and_evaluation.py file to get the evaluation results.
We report the leaderboard on AGIEval v1.1. The leaderboard contains two subsets AGIEval-en and AGIEval-zh. The two subset leaderboards contain only MCQ tasks. The leaderboard is as follows:
Model | Source | Average |
---|---|---|
GPT-4o | Link | 71.4 |
Llama 3 400B+ | Link | 69.9 |
Llama 3 70B | Link | 63 |
Mixtral 8x22B | Link | 61.2 |
GPT-3.5-Turbo | Link | 52.7 |
Llama 3 8B | Link | 45.9 |
Gemma 7B | Link | 44.9 |
Mistral 7B | Link | 44 |
Model | Source | Average |
---|---|---|
GPT-4o | Link | 71.9 |
GPT-3.5-Turbo | Link | 49.5 |
Model | Source | Average |
---|---|---|
GPT-4o | Link | 69.0 |
GPT-3.5-Turbo | Link | 47.2 |
Model | Source | Average |
---|---|---|
GPT-4o | Link | 65.2 |
GPT-3.5-Turbo | Link | 54.1 |
Model | Source | Average |
---|---|---|
GPT-4o | Link | 63.3 |
GPT-3.5-Turbo | Link | 45.0 |
(Asterisk sign indicates results reported for AGIEval v1.0.)
Model | Source | Average |
---|---|---|
GPT-4o | Link | 62.3 |
InternLM2-20B* | Link | 53.0 |
Qwen-14B* | Link | 52.0 |
Phi-3-medium 14b* | Link | 50.2 |
InternLM2-Chat-7B-SFT* | Link | 49.0 |
GPT-3.5-Turbo | Link | 46.0 |
Qwen-7B* | Link | 45.6 |
Mixtral 8x7b* | Link | 45.2 |
Phi-3-small 7b* | Link | 45.1 |
Gemma 7b* | Link | 42.1 |
Llama-3-In* | Link | 42.0 |
Phi-3-mini 3.8b* | Link | 37.5 |
Mistral 7b* | Link | 35.1 |
Phi-2 2.7b* | Link | 29.8 |
If you use AGIEval benchmark or the code in your research, please cite our paper:
@misc{zhong2023agieval,
title={AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models},
author={Wanjun Zhong and Ruixiang Cui and Yiduo Guo and Yaobo Liang and Shuai Lu and Yanlin Wang and Amin Saied and Weizhu Chen and Nan Duan},
year={2023},
eprint={2304.06364},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.