AGIEval

This repository contains information about AGIEval, data, code and output of baseline systems for the benchmark.

Introduction

AGIEval is a human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving. This benchmark is derived from 20 official, public, and high-standard admission and qualification exams intended for general human test-takers, such as general college admission tests (e.g., Chinese College Entrance Exam (Gaokao) and American SAT), law school admission tests, math competitions, lawyer qualification tests, and national civil service exams. For a full description of the benchmark, please refer to our paper: AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models.

Tasks and Data

We have updated the dataset to version 1.1. The new version updated Chinese Gaokao (chemistry, biology, physics) datasets with questions from 2023 and addressed annotation issues. To facilitate evaluation, now all multi-choice question (MCQ) tasks have one answer only (Gaokao-Physics and JEC-QA used to have multi-label answers). AGIEval-en datasets remain the same as Verison 1.0. The new version's statistics are as follows:

AGIEval v1.1 contains 20 tasks, including 18 MCQ tasks and two cloze tasks (Gaokao-Math-Cloze and MATH). You can find the full list of tasks in the table below. The datasets used in AGIEVal

You can download all post-processed data in the data/v1_1 folder. All usage of the data should follow the license of the original datasets.

The data format for all datasets is as follows:

{
    "passage": null,
    "question": "设集合 $A=\\{x \\mid x \\geq 1\\}, B=\\{x \\mid-1<x<2\\}$, 则 $A \\cap B=$ ($\\quad$)\\\\\n",
    "options": ["(A)$\\{x \\mid x>-1\\}$", 
        "(B)$\\{x \\mid x \\geq 1\\}$", 
        "(C)$\\{x \\mid-1<x<1\\}$", 
        "(D)$\\{x \\mid 1 \\leq x<2\\}$"
        ],
    "label": "D",
    "answer": null
}

The passage field is available for gaokao-chinese, gaokao-english, both of logiqa, all of LSAT, and SAT. The answer for multi-choice tasks is saved in the label field. The answer for cloze tasks is saved in the answer field.

We provide the prompts for few-shot learning in the data/few_shot_prompts file.

Baseline Systems

We evaluate the performance of the baseline systems (gpt-3.5-turbo and GPT-4o) on AGIEval v1.1. The results are as follows:

The datasets used in AGIEVal

You can replicate the results by following the steps below:

Update your OpenAI API in the openai_api.py file.
run the run_prediction.py script to get the results.

Evaluation

You can run the post_process_and_evaluation.py file to get the evaluation results.

Leaderboard

We report the leaderboard on AGIEval v1.1. The leaderboard contains two subsets AGIEval-en and AGIEval-zh. The two subset leaderboards contain only MCQ tasks. The leaderboard is as follows:

AGIEval-en few-shot

Model	Source	Average
GPT-4o	Link	71.4
Llama 3 400B+	Link	69.9
Llama 3 70B	Link	63
Mixtral 8x22B	Link	61.2
GPT-3.5-Turbo	Link	52.7
Llama 3 8B	Link	45.9
Gemma 7B	Link	44.9
Mistral 7B	Link	44

AGIEval-zh few-shot

Model	Source	Average
GPT-4o	Link	71.9
GPT-3.5-Turbo	Link	49.5

AGIEval-all few-shot

Model	Source	Average
GPT-4o	Link	69.0
GPT-3.5-Turbo	Link	47.2

AGIEval-en zero-shot

Model	Source	Average
GPT-4o	Link	65.2
GPT-3.5-Turbo	Link	54.1

AGIEval-zh zero-shot

Model	Source	Average
GPT-4o	Link	63.3
GPT-3.5-Turbo	Link	45.0

AGIEval-all zero-shot

(Asterisk sign indicates results reported for AGIEval v1.0.)

Model	Source	Average
GPT-4o	Link	62.3
InternLM2-20B*	Link	53.0
Qwen-14B*	Link	52.0
Phi-3-medium 14b*	Link	50.2
InternLM2-Chat-7B-SFT*	Link	49.0
GPT-3.5-Turbo	Link	46.0
Qwen-7B*	Link	45.6
Mixtral 8x7b*	Link	45.2
Phi-3-small 7b*	Link	45.1
Gemma 7b*	Link	42.1
Llama-3-In*	Link	42.0
Phi-3-mini 3.8b*	Link	37.5
Mistral 7b*	Link	35.1
Phi-2 2.7b*	Link	29.8

Citation

If you use AGIEval benchmark or the code in your research, please cite our paper:

@misc{zhong2023agieval,
      title={AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models}, 
      author={Wanjun Zhong and Ruixiang Cui and Yiduo Guo and Yaobo Liang and Shuai Lu and Yanlin Wang and Amin Saied and Weizhu Chen and Nan Duan},
      year={2023},
      eprint={2304.06364},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

ruixiangcui / AGIEval

readme