The papers are organized according to our survey:
Evaluating Large Language Models: A Comprehensive Survey
Zishan Guo*, Renren Jin*, Chuang Liu*, Yufei Huang, Dan Shi, Supryadi,
Linhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong, Deyi Xiong†
Tianjin University
(*: Co-first authors, †: Corresponding author)
If you find our survey useful, please kindly cite our paper:
@article{guo2023evaluating,
title={Evaluating Large Language Models: A Comprehensive Survey},
author={Guo, Zishan and Jin, Renren and Liu, Chuang and Huang, Yufei and Shi, Dan and Yu, Linhao and Liu, Yan and Li, Jiaxuan and Xiong, Bojian and Xiong, Deyi and others},
journal={arXiv preprint arXiv:2310.19736},
year={2023}
}
Feel free to open an issue/PR or e-mail guozishan@tju.edu.cn, rrjin@tju.edu.cn, liuc_09@tju.edu.cn and dyxiong@tju.edu.cn if you find any missing areas, papers, or datasets. We will keep updating this list and survey.
Large language models (LLMs) have demonstrated remarkable capabilities across a broad spectrum of tasks. They have attracted significant attention and been deployed in numerous downstream applications. Nevertheless, akin to a double-edged sword, LLMs also present potential risks. They could suffer from private data leaks or yield inappropriate, harmful, or misleading content. Additionally, the rapid progress of LLMs raises concerns about the potential emergence of superintelligent systems without adequate safeguards. To effectively capitalize on LLM capacities as well as ensure their safe and beneficial development, it is critical to conduct a rigorous and comprehensive evaluation of LLMs.
This survey endeavors to offer a panoramic perspective on the evaluation of LLMs. We categorize the evaluation of LLMs into three major groups: knowledge and capability evaluation, alignment evaluation and safety evaluation. In addition to the comprehensive review on the evaluation methodologies and benchmarks on these three aspects, we collate a compendium of evaluations pertaining to LLMs' performance in specialized domains, and discuss the construction of comprehensive evaluation platforms that covers LLM evaluations on capabilities, alignment, safety, sand applicability.
We hope that this comprehensive overview will stimulate further research interests in the evaluation of LLMs, with the ultimate goal of making evaluation serve as a cornerstone in guiding the responsible development of LLMs. We envision that this will channel their evolution into a direction that maximizes societal benefit while minimizing potential risks.
The paper proposes a dataset that can be used for LLMs evaluation.
The paper proposes an evaluation method that can be used for LLMs.
The paper proposes a platform for LLMs evaluation.
The paper examines the performance of LLMs in a particular domain.
"Through the Lens of Core Competency: Survey on Evaluation of Large Language Models".
"A Survey on Evaluation of Large Language Models".
Yupeng Chang and Xu Wang et al. arXiv 2023. [Paper] [GitHub]
Squad: "Squad: 100, 000+ questions for machine comprehension of text".
NarrativeQA: "The narrativeqa reading comprehension challenge".
Hotpotqa: "Hotpotqa: A dataset for diverse, explainable multi-hop question answering".
CoQA: "Coqa: A conversational question answering challenge".
NQ: "Natural questions: a benchmark for question answering research".
DuReader: "Dureader_robust: A chinese dataset towards evaluating robustness and generalization of machine reading comprehension in real-world applications".
RAGAS: "RAGAS: Automated Evaluation of Retrieval Augmented Generation".
"Why Does ChatGPT Fall Short in Providing Truthful Answers?".
Shen Zheng and Jie Huang et al. arXiv 2023. [Paper]
LAMA: "Language Models as Knowledge Bases?".
Kola: "Kola: Carefully Benchmarking World Knowledge of Large Language models".
WikiFact: "Assessing the Factual Accuracy of Generated Text".
Ben Goodrich et al. KDD 2019. [Paper]
ARC: "Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge".
QASC: "QASC: A Dataset for Question Answering via Sentence Composition".
MCTACO: ""Going on a vacation" takes longer than "Going for a walk": A Study of Temporal Commonsense Understanding".
TRACIE: "Temporal Reasoning on Implicit Events from Distant Supervision".
TIMEDIAL: "TIMEDIAL: Temporal Commonsense Reasoning in Dialog".
HellaSWAG: "HellaSwag: Can a Machine Really Finish Your Sentence?".
PIQA: "PIQA: Reasoning about Physical Commonsense in Natural Language".
Pep-3k: "Modeling Semantic Plausibility by Injecting World Knowledge".
Social IQA: "Social IQa: Commonsense Reasoning about Social Interactions".
Maarten Sap and Hannah Rashkin et al. EMNLP 2019. [Paper] [Source]
CommonsenseQA: "CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge".
Alon Talmor and Jonathan Herzig et al. NAACL 2019. [Paper] [GitHub]
OpenBookQA: "Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering".
"A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity".
"ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models".
Ning Bian et al. arXiv 2023. [Paper]
SNLI: "A large annotated corpus for learning natural language inference".
Samuel R. Bowman et al. EMNLP 2015. [Paper]
MultiNLI: "A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference".
LogicNLI: "Diagnosing the First-Order Logical Reasoning Ability Through LogicNLI".
Jidong Tian and Yitian Li et al. EMNLP 2021. [Paper]
ConTRoL: "Natural Language Inference in Context — Investigating Contextual Reasoning over Long Texts".
MED: "Can Neural Networks Understand Monotonicity Reasoning?".
Hitomi Yanaka et al. ACL Workshop BlackboxNLP 2019. [Paper] [GitHub]
HELP: "HELP: A Dataset for Identifying Shortcomings of Neural Models in Monotonicity Reasoning".
ConjNLI: "ConjNLI: Natural Language Inference Over Conjunctive Sentences".
TaxiNLI: "TaxiNLI: Taking a Ride up the NLU Hill".
Pratik Joshi, Somak Aditya and Aalok Sathe et al. CoNLL 2020. [Paper] [GitHub]
ReClor: "ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning".
Weihao Yu and Zihang Jiang et al. ICLR 2020. [Paper] [Source]
LogiQA: "LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning".
LogiQA 2.0: "LogiQA 2.0 — An Improved Dataset for Logical Reasoning in Natural Language Understanding".
LSAT: "From LSAT: The Progress and Challenges of Complex Reasoning".
Siyuan Wang et al. TASLP 2021. [Paper]
LogicInference: "LogicInference: A New Dataset for Teaching Logical Inference to seq2seq Models".
Santiago Ontanon et al. ICLR OSC workshop 2022. [Paper] [GitHub]
FOLIO: "FOLIO: Natural Language Reasoning with First-Order Logic".
"Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond".
"A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity".
"Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4".
HotpotQA: "HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering".
Zhilin Yang, Peng Qi and Saizheng Zhang et al. EMNLP 2018. [Paper] [GitHub]
HybridQA: "HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data".
MultiRC: "Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences".
NarrativeQA: "The NarrativeQA Reading Comprehension Challenge".
Wikihop, Medhop: "Constructing Datasets for Multi-hop Reading Comprehension Across Documents".
"A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity".
"How is ChatGPT's behavior changing over time?".
MultiArith: "Solving General Arithmetic Word Problems".
Subhro Roy and Dan Roth et al. EMNLP 2015. [Paper]
AddSub: "Learning to Solve Arithmetic Word Problems with Verb Categorization".
Mohammad Javad Hosseini et al. ACL 2014. [Paper]
AQUA: "Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems".
Wang Ling et al. ACL 2017. [Paper]
SVAMP: "Are NLP Models Really Able to Solve Simple Math Word Problems".
GSM8K: "Training Verifiers to Solve Math Word Problems".
M3KE: "M3KE: A Massive Multi-level Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models".
VNHSGE: "VNHSGE: Vietnamese High School Graduation Examination Dataset for Large Language Models".
MATH: "Measuring Mathematical Problem Solving with the MATH Dataset".
JEEBench: "Have LLMs Advanced Enough A Challenging Problem Solving Benchmark for Large Language Models".
MATH401: "How Well Do Large Language Models Perform in Arithmetic Tasks".
CMATH: "CMATH: Can Your Language Model Pass Chinese Elementary School Math Test?".
WeiTian Wen et al. arXiv 2023. [Paper]
AUTOPROMPT: "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models".
Jason Wei et al. NeurIPS 2022. [Paper]
"Evaluating Language Models for Mathematics Through Interactions".
Katherine M. Collins et al. arXiv 2023. [Paper]
RestBench: "RestGPT: Connecting Large Language Models with Real-World RESTful APIs".
SayCan: "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances".
WebCPM: "WebCPM: Interactive Web Search for Chinese Long-form Question Answering".
WebShop: "WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents".
ToolAlpaca: "ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases".
"Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models".
Cheng-Yu Hsieh et al. arXiv 2023. [Paper]
ToolQA: "ToolQA: A Dataset for LLM Question Answering with External Tools".
Toolformer: "Toolformer: Language Models Can Teach Themselves to Use Tools".
ALFRED: "ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks".
ALFWorld: "ALFWorld: Aligning Text and Embodied Environments for Interactive Learning".
BEHAVIOR: "BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and Ecological Environments".
Inner Monologue: "Inner Monologue: Embodied Reasoning through Planning with Language Models".
API-Bank: "API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs".
"On the Tool Manipulation Capability of Open-source Large Language Models".
Qiantong Xu et al. arXiv 2023. [Paper]
"Tool Learning with Foundation Models".
ToolEval: "ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs".
LaMDA: "LaMDA: Language Models for Dialog Applications".
GeneGPT: "GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information".
Code as Policies: "Code as Policies: Language Model Programs for Embodied Control".
"Augmented Language Models: a Survey".
Grégoire Mialon et al. arXiv 2023. [Paper]
"Negated and Misprimed Probes for Pretrained Language Models: Birds Can Talk, But Cannot Fly".
"UnCommonSense: Informative Negative Knowledge about Everyday Concepts".
"Beyond Positive Scaling: How Negation Impacts Scaling Trends of Language Models".
Yuhui Zhang and Michihiro Yasunaga et al. ACL (Findings) 2023. [Paper][Github]
"Say What You Mean! Large Language Models Speak Too Positively about Negative Commonsense Knowledge".
ScoNe: "ScoNe: Benchmarking Negation Reasoning in Language Models With Fine-Tuning and In-Context Learning".
xNot360: "A negation detection assessment of GPTs: analysis with the xNot360 dataset".
"This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models".
Iker García-Ferrero et al. EMNLP 2023. [Paper][Github][Source]
"Classification of moral foundations in microblog political discourse".
Kristen Johnson et al. ACL 2018. [Paper]
Social chemistry 101: "Social chemistry 101: Learning to reason about social and moral norms".
Moral Foundations Twitter Corpus: "Moral foundations twitter corpus: A collection of 35k tweets annotated for moral sentiment".
Joe Hoover et al. [Paper]
"Moral stories: Situated reasoning about norms, intents, actions, and their consequences".
"Analysis of moral judgement on reddit".
Nicholas Botzer et al. CoRR 2021. [Paper]
MIC: "The moral integrity corpus: A benchmark for ethical dialogue systems".
“When to make exceptions:Exploring language models as accounts of human moral judgment”.
"Prosocialdialog: A prosocial backbone for conversational agents".
SCRUPLES: "SCRUPLES: A corpus of community ethical judgments on 32, 000 real-life anecdotes".
"Trustgpt:A benchmark for trustworthy and responsible large language models".
"Aligning AI with shared human values".
"Evaluating the moral beliefs encoded in llms".
Winogender: "Gender Bias in Coreference Resolution".
WinoBias: "Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods".
GICOREF: "Toward Gender-Inclusive Coreference Resolution: An Analysis of Gender and Bias Throughout the Machine Learning Lifecycle".
Yang Trista Cao et al. Comput. Linguistics 2021. [Paper]
WinoMT: "Evaluating Gender Bias in Machine Translation".
"Investigating Failures of Automatic Translationin the Case of Unambiguous Gender".
Adithya Renduchintala et al. ACL 2022. [Paper]
"Addressing Age-Related Bias in Sentiment Analysis".
EEC: "Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems".
Kiritchenko Svetlana et al. NAACL HLT 2018. [Paper] [Source]
WikiGenderBias: "Towards Understanding Gender Bias in Relation Extraction".
"Measuring and Mitigating Unintended Bias in Text Classification".
"Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification".
Daniel Borkan et al. WWW 2019. [Paper]
"Social Bias Frames: Reasoning about Social and Power Implications of Language".
"Finding Microaggressions in the Wild: A Case for Locating Elusive Phenomena in Social Media Posts".
Breitfeller Luke et al. EMNLP-IJCNLP 2019. [Paper]
Latent Hatred: "Latent Hatred: A Benchmark for Understanding Implicit Hate Speech".
DynaHate: "Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection".
TOXIGEN: "ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection".
Thomas Hartvigsen et al. ACL 2022. [Paper] [GitHub] [Source]
CDail-Bias: "Towards Identifying Social Bias in Dialog Systems: Frame, Datasets, and Benchmarks".
CORGI-PM: "CORGI-PM: A Chinese Corpus For Gender Bias Probing and Mitigation".
HateCheck: "HateCheck: Functional Tests for Hate Speech Detection Models".
StereoSet: "StereoSet: Measuring stereotypical bias in pretrained language models".
Moin Nadeem et al. ACL/IJCNLP 2021. [Paper] [GitHub] [Source]
CrowS-Pairs: "CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models".
"Does gender matter? towards fairness in dialogue systems".
BOLD: "BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation".
HolisticBias: "“I’m sorry to hear that”: Finding New Biases in Language Models with a Holistic Descriptor Dataset".
Multilingual Holistic Bias: "Multilingual Holistic Bias: Extending Descriptors and Patterns to Unveil Demographic Biases in Languages at Scale".
Eric Michael Smith et al. arXiv 2023. [Paper]
Unqover: "UNQOVERing Stereotyping Biases via Underspecified Questions".
BBQ: "BBQ: A Hand-Built Bias Benchmark for Question Answering".
CBBQ: "CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models".
"Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer".
FairLex: "FairLex: A Multilingual Benchmark for Evaluating Fairness in Legal Text Processing".
"Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification".
Daniel Borkan et al. WWW 2019. [Paper]
"On measuring and mitigating biased inferences of word embeddings".
Sunipa Dev et al. AAAI 2020. [Paper]
"An Empirical Study of Metrics to Measure Representational Harms in Pre-Trained Language Models".
"Revealing Persona Biases in Dialogue Systems".
"On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? ".
Emily M. Bender et al. FAccT 2021. [Paper]
"A Survey on Hate Speech Detection using Natural Language Processing."
Anna Schmidt et al. SocialNLP 2017. [Paper]
"Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity".
Terry Yue Zhuo et al. arXiv 2023. [Paper]
OLID: "Predicting the Type and Target of Offensive Posts in Social Media".
Marcos Zampiari et al. NAACL-HLT 2019. [Paper]
SOLID: "The narrativeqa reading comprehension challenge".
Sara Rosenthal et al. ACL/IJCNLP (Findings) 2021. [Paper] [Source]
OLID-BR: "OLID‑BR: ofensive language identifcation dataset for Brazilian Portuguese".
KODOLI: ""Why do I feel offended?" - Korean Dataset for Offensive Language Identification".
RealToxicityPrompts: "RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models."
HarmfulQ: "On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning".
"Toxicity in ChatGPT: Analyzing Persona-assigned Language Models".
Ameet Deshpande et al. arXiv 2023 [Paper]
"Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity".
Terry Yue Zhuo et al. arXiv 2023. [Paper]
NewsQA: "NewsQA: A Machine Comprehension Dataset".
Adam Trischler, Tong Wang, and Xingdi Yuan et al. Rep4NLP@ACL 2017. [Paper] [GitHub]
SQuAD 2.0: "Know What You Don't Know: Unanswerable Questions for SQuAD".
Pranav Rajpurkar and Robin Jia et al. ACL 2018. [Paper] [Source]
BIG-bench: "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models".
SelfAware: "Do Large Language Models Know What They Don’t Know?".
TruthfulQA: "TruthfulQA: Measuring How Models Mimic Human Falsehoods".
HalluQA: "Evaluating Hallucinations in Chinese Large Language Models".
DialFact: "DialFact: A Benchmark for Fact-Checking in Dialogue".
"Q2: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering".
BEGIN: "Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark".
Nouha Dziri and Hannah Rashkin et al. TACL 2022. [Paper] [GitHub]
ConsisTest: "What Was Your Name Again? Interrogating Generative Conversational Models For Factual Consistency Evaluation".
XSumFaith: "On Faithfulness and Factuality in Abstractive Summarization".
Joshua Maynez and Shashi Narayan et al. ACL 2020. [Paper] [GitHub]
FactCC: "Evaluating the Factual Consistency of Abstractive Text Summarization".
SummEval: "SummEval: Re-evaluating Summarization Evaluation".
Alexander R. Fabbri and Wojciech Kryściński et al. TACL 2021. [Paper] [GitHub]
FRANK: "Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics".
SummaC: "SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization".
"Asking and Answering Questions to Evaluate the Factual Consistency of Summaries".
"Annotating and Modeling Fine-grained Factuality in Summarization".
"Hallucinated but Factual! Inspecting the Factuality of Hallucinations in Abstractive Summarization".
CLIFF: "CLIFF: Contrastive Learning for Improving Faithfulness and Factuality in Abstractive Summarization".
AggreFact: "Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors".
PolyTope: "What Have We Achieved on Text Summarization?".
Dandan Huang and Leyang Cui et al. EMNLP 2020. [Paper] [GitHub]
FIB: "Evaluating the Factual Consistency of Large Language Models Through News Summarization".
FacTool: "FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios".
CONNER: "Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators".
FActScore: "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation".
SelfCheckGPT: "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models".
SAPLMA: "The Internal State of an LLM Knows When It's Lying".
Amos Azaria et al. arXiv 2023. [Paper]
"Teaching Models to Express Their Uncertainty in Words".
Stephanie Lin et al. arXiv 2022. [Paper]
"Language Models (Mostly) Know What They Know".
Saurav Kadavath et al. arXiv 2022. [Paper]
"Dialogue Natural Language Inference".
Sean Welleck et al. ACL 2019. [Paper]
"Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference".
Tobias Falke et al. ACL 2019. [Paper]
"mFACE: Multilingual Summarization with Factual Consistency Evaluation".
Roee Aharoni et al. arXiv 2022. [Paper]
"Falsesum: Generating Document-level NLI Examples for Recognizing Factual Inconsistency in Summarization".
"Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback".
Paul Roit, Johan Ferret, and Lior Shani et al. ACL 2023. [Paper]
FEQA: "FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization".
QuestEval: "QuestEval: Summarization Asks for Fact-based Evaluation".
QAFactEval: "QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization".
FaithDial: "FaithDial: A Faithful Benchmark for Information-Seeking Dialogue".
"How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions".
Lorenzo Pacchiardi and Alex J. Chan et al. arXiv 2023. [Paper] [GitHub]
"Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning".
"HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models".
Fuxiao Liu and Tianrui Guan et al. arXiv 2023. [Paper] [GitHub]
"Analyzing and Evaluating Faithfulness in Dialogue Summarization".
"TrustGPT: A Benchmark for Trustworthy and Responsible Large Language Models".
"Safety Assessment of Chinese Large Language Models".
"FLASK: Fine-grained Language Model Evaluation Based on Alignment Skill Sets".
"Judging LLM-as-a-judge with MT-Bench and Chatbot Arena".
"Helpful, Honest, & Harmless - a Pragmatic Alignment Evaluation".
Amanda Askell et al. GitHub 2022. [GitHub]
"A Critical Evaluation of Evaluations for Long-form Question Answering".
"AlpacaEval: An Automatic Evaluator of Instruction-following Models".
Xuechen Li et al. Github 2023. [Github]
"PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization".
"Large Language Models are not Fair Evaluators".
"G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment".
"Benchmarking Foundation Models with Language-Model-as-an-Examiner".
"PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations".
"SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions"
PromptBench: "PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts".
"On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective".
RobuT: "RobuT: A Systematic Study of Table QA Robustness Against Human-Annotated Adversarial Perturbations".
SynTextBench: "On Robustness-Accuracy Characterization of Large Language Models using Synthetic Datasets".
Ching-Yun Ko et al. ICML 2023. [Paper]
ReCode: "ReCode: Robustness Evaluation of Code Generation Models".
"Exploring the Robustness of Large Language Models for Solving Programming Problems".
"A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models".
DGSlow: "White-Box Multi-Objective Adversarial Attack on Dialogue Generation".
"Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study".
Yi Liu et al. arXiv 2023. [Paper]
MasterKey: "MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots".
Gelei Deng et al. arXiv 2023. [Paper]
JailBroken: "Jailbroken: How Does LLM Safety Training Fail?".
Alexander Wei et al. NeurIPS 2023. [Paper]
"Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity".
Terry Yue Zhuo et al. arXiv 2023. [Paper]
"On Robustness of Prompt-based Semantic Parsing with Large Pre-trained Language Model: An Empirical Study on Codex".
Terry Yue Zhuo et al. EACL 2023. [Paper]
"How Important are Good Method Names in Neural Code Generation? A Model Robustness Perspective".
Guang Yang et al. TOSEM 2023. [Paper]
"Ask Again, Then Fail: Large Language Models' Vacillations in Judgement".
Qiming Xie and Zengzhi Wang et al. arXiv 2023. [Paper] [Github]
"Frontier AI Regulation: Managing Emerging Risks to Public Safety".
Markus Anderljung et al. arXiv 2023. [Paper]
"Model evaluation for extreme risks".
Toby Shevlane et al. arXiv 2023. [Paper]
"Is Power-Seeking AI an Existential Risk?".
Joseph Carlsmith. arXiv 2023. [Paper]
"Discovering Language Model Behaviors with Model-Written Evaluations".
Ethan Perez et al. ACL (Findings) 2023. [Paper]
"Evaluating Superhuman Models with Consistency Checks".
Lukas Fluri et al. arXiv 2023. [Paper]
"Understanding Social Reasoning in Language Models with Language Models".
Kanishk Gandhi et al. arXiv 2023. [Paper]
"Towards the Scalable Evaluation of Cooperativeness in Language Models".
Alan Chan et al. arXiv 2023. [Paper]
"Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations".
Yanda Chen et al. arXiv 2023. [Paper]
"AgentBench: Evaluating LLMs as Agents".
Xiao Liu et al. arXiv 2023. [Paper]
"WebArena: A Realistic Web Environment for Building Autonomous Agents".
Shuyan Zhou et al. arXiv 2023. [Paper]
"Training Socially Aligned Language Models in Simulated Human Society".
Ruibo Liu et al. arXiv 2023. [Paper]
"AgentSims: An Open-Source Sandbox for Large Language Model Evaluation".
Jiaju Lin et al. EMNLP 2023 demo track. [Paper]
"Evaluating Language-Model Agents on Realistic Autonomous Tasks".
Megan Kinniment et al. ARC Evals. [Paper]
MINT: "MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback"
"Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models"
"InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback"
John Yang et al. NeurIPS 2023 Datasets & Benchmarks track. [Paper] [Github] [Source]
MultiMedQA: "Large Language Models Encode Clinical Knowledge".
Karan Singhal, Shekoofeh Azizi and Tao Tu et al. arXiv 2022. [Paper]
PubMedQA: "PubMedQA: A Dataset for Biomedical Research Question Answering".
LiveQA: "Overview of the Medical Question Answering Task at TREC 2017 LiveQA".
CLUE: "Clinical language understanding evaluation (CLUE)".
Travis R. Goodwin et al. arXiv 2022. [Paper]
"Towards Expert-Level Medical Question Answering with Large Language Models".
Karan Singhal, Tao Tu, Juraj Gottweis and Rory Sayres et al. arXiv 2023. [Paper]
"Performance of ChatGPT on USMLE: Unlocking the Potential of Large Language Models for AI-Assisted Medical Education".
Prabin Sharma et al. arXiv 2023. [Paper]
"Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum".
John W. Ayers et al. JAMA Internal Medicine 2023. [Paper]
"Evaluating large language models on medical evidence summarization".
Liyan Tang et al. npj Digital Medicine 2023. [Paper]
"Can large language models reason about medical questions?".
"Capabilities of GPT-4 on Medical Challenge Problems".
Harsha Nori et al. arXiv 2023. [Paper]
"Evaluating the performance of chatgpt in ophthalmology: An analysis of its successes and shortcomings".
Fares Antaki et al. Ophthalmology Science 2023. [Paper]
"Chatgpt goes to the operating room: evaluating gpt-4 performance and its potential in surgical education and training in the era of large language models".
Namkee Oh et al. Annals of Surgical Treatment and Research 2023. [Paper]
"The AI teacher test: Measuring the pedagogical ability of blender and GPT-3 in educational dialogues".
"Is ChatGPT a Good Teacher Coach? Measuring Zero-Shot Performance For Scoring and Providing Actionable Insights on Classroom Instruction".
"Learning gain differences between ChatGPT and human tutor generated algebra hints".
"Can Large Language Models Provide Feedback to Students? A Case Study on ChatGPT".
Wei Dai et al. ICALT 2023. [Paper]
"GPT-4 Passes the Bar Exam".
L’ART: "How well do SOTA legal reasoning models support abductive reasoning?".
Ha-Thanh Nguyen et al. ICLP 2023. [Paper]
"GPT Takes the Bar Exam".
"ChatGPT Goes to Law School".
Jonathan H. Choi et al. SSRN 2023. [Paper]
"Explaining Legal Concepts with Augmented Large Language Models (GPT-4)".
Jaromir Savelka et al. arXiv 2023. [Paper]
"How Ready are Pre-trained Abstractive Models and LLMs for Legal Case Judgement Summarization?".
Aniket Deroy et al. LegalAIIA 2023. [Paper]
"Legal Prompting: Teaching a Language Model to Think Like a Lawyer".
Fangyi Yu et al. arXiv 2022. [Paper]
"Can GPT-3 Perform Statutory Reasoning?".
LawBench: "LawBench: Benchmarking Legal Knowledge of Large Language Models".
Zhiwei Fei, Xiaoyu Shen and Dawei Zhu et al. arXiv 2023. [Paper] [GitHub]
"A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction".
"A Systematic Evaluation of Large Language Models of Code".
"Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation".
"Lost at C: A user study on the security implications of large language model code assistants".
Sandoval G et al. arXiv 2023. [Paper]
"SWE-bench: Can Language Models Resolve Real-World GitHub Issues?".
Jimenez, Carlos E et al. arXiv 2023. [Paper] [Github] [Source]
"InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback".
John Yang et al. NeurIPS 2023 Datasets & Benchmarks track. [Paper] [Github] [Source]
"DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation".
"Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters".
Zhang X et al. CIKM 2023. [Paper]
"FinBERT: A large language model for extracting information from financial text".
Huang A H et al. Contemporary Accounting Research 2023. [Paper]
"ChatGPT: Unlocking the future of NLP in finance".
Zaremba A et al. SSRN 2023. [Paper]
"GPT as a Financial Advisor".
Niszczota P et al. SSRN 2023. [Paper]
GLUE: "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding".
SuperGLUE: "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems".
LongBench: "LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding".
MMLU: "Measuring Massive Multitask Language Understanding".
MMCU: "Measuring Massive Multitask Chinese Understanding".
C-Eval: "C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models".
M3KE: "M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models".
CMMLU: "CMMLU: Measuring massive multitask language understanding in Chinese".
AGIEval: "AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models".
M3Exam: "M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models".
LucyEval: "Evaluating the Generation Capabilities of Large Chinese Language Models".
Big-bench: "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models".
Evaluation Harness: "A framework for few-shot language model evaluation".
Leo Gao et al. arXiv 2023. [GitHub]
HELM: "Holistic Evaluation of Language Models".
OpenAI Evals [GitHub]
GPT-Fathom: "GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond".
Shen Zheng and Yuyu Zhang et al. arXiv 2023. [Paper] [GitHub]
"INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models".
Huggingface Open LLM Leaderboard [Source]
Chatbot Arena: "Judging LLM-as-a-judge with MT-Bench and Chatbot Arena".
OpenCompass: "Evaluating the Generation Capabilities of Large Chinese Language Models".
CLEVA: "CLEVA: Chinese Language Models EVAluation Platform".
OpenEval [Source]
Platform | Access | Domain |
---|---|---|
Chatbot Arena | [Source] | Evaluation Organization/ Benchmark for Holistic Evaluation |
CLEVA | [Source] | Evaluation Organization/ Benchmark for Holistic Evaluation |
FlagEval | [Source] | Evaluation Organization/ Benchmark for Holistic Evaluation |
HELM | [Source] | Evaluation Organization/ Benchmark for Holistic Evaluation |
Huggingface Open LLM Leaderboard | [Source] | Evaluation Organization/ Benchmark for Holistic Evaluation |
InstructEval | [Source] | Evaluation Organization/ Benchmark for Holistic Evaluation |
LLMonitor | [Source] | Evaluation Organization/ Benchmark for Holistic Evaluation |
OpenCompass | [Source] | Evaluation Organization/ Benchmark for Holistic Evaluation |
Open Ko-LLM | [Source] | Evaluation Organization/ Benchmark for Holistic Evaluation |
SuperCLUE | [Source] | Evaluation Organization/ Benchmark for Holistic Evaluation |
TheoremOne LLM Benchmarking Metrics | [Source] | Evaluation Organization/ Benchmark for Holistic Evaluation |
Toloka | [Source] | Evaluation Organization/ Benchmark for Holistic Evaluation |
Open Multilingual LLM Eval | [Source] | Evaluation Organization/ Benchmark for Holistic Evaluation |
OpenEval | [Source] | Evaluation Organization/ Benchmark for Holistic Evaluation |
ANGO | [Source] | Evaluation Organization/ Benchmarks for Knowledge and Reasoning |
C-Eval | [Source] | Evaluation Organization/ Benchmarks for Knowledge and Reasoning |
LucyEval | [Source] | Evaluation Organization/ Benchmarks for Knowledge and Reasoning |
MMLU | [Source] | Evaluation Organization/ Benchmarks for Knowledge and Reasoning |
OpenKG LLM | [Source] | Evaluation Organization/ Benchmarks for Knowledge and Reasoning |
SEED-Bench | [Source] | Evaluation Organization/ Benchmarks for NLU and NLG |
SuperGLUE | [Source] | Evaluation Organization/ Benchmarks for NLU and NLG |
Toolbench | [Source] | Knowledge and Capability Evaluation/ Tool Learning |
Hallucination Leaderboard | [Source] | Alignment Evaluation/ Truthfulness |
AlpacaEval | [Source] | Alignment Evaluation/ General Alignment Evaluation |
AgentBench | [Source] | Safety Evaluation/ Evaluating LLMs as Agents |
InterCode | [Source] | Safety Evaluation/ Evaluating LLMs as Agents |
SafetyBench | [Source] | Safety Evaluation |
Nucleotide Transformer | [Source] | Specialized LLMs Evaluation/ Biology and Medicine |
LAiW | [Source] | Specialized LLMs Evaluation/ Legislation |
Big Code Models Leaderboard | [Source] | Specialized LLMs Evaluation/ Computer Science |
Huggingface LLM Perf Leaderboard | [Source] | the Performance of LLMs |