[ ] ClipNotes for paper: GPTScore - Evaluate as you desire

GPTScore: Evaluate as You Desire

Introduction

Assessing the quality of generated text is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities (e.g., zero-shot instruction) of generative pre-trained models to score generated texts. There are 19 pre-trained models explored in this paper, ranging in size from 80M (e.g., FLAN-T5-small) to 175B (e.g., GPT3).

Experimental results on four text generation tasks, 22 evaluation aspects, and corresponding 37 datasets demonstrate that this approach can effectively allow us to achieve what one desires to evaluate for texts simply by natural language instructions. This nature helps us overcome several long-standing challenges in text evaluation–how to achieve customized, multi-faceted evaluation without the need for annotated samples.

While text generation technology is advancing rapidly, techniques for evaluating the quality of these texts lag far behind. This is especially evident in the following ways:

Limitations of existing text evaluation techniques

(a) Existing studies evaluate text quality with limited aspects (e.g., semantic equivalence, fluency), which are usually customized prohibitively, making it harder for users to evaluate aspects as they need. (b) A handful of studies have examined multi-aspect evaluation but have not given adequate attention to the definition of the evaluation aspect and the latent relationship among them. Instead, the evaluation of an aspect is either empirically bound with metric variants or learned by supervised signals. (c) Recently proposed evaluation methods usually necessitate a complicated training procedure or costly manual annotation of samples, which makes it hard to use these methods in industrial settings.

In this paper, we demonstrated the talent of super large pre-trained language models (e.g., GPT-3) in achieving multi-aspect, customized, and training-free evaluation. In essence, it skillfully uses the pre-trained model's zero-shot instruction, and in-context learning ability to deal with complex and ever-changing evaluation needs.

Emergent abilities of large language models

Recent works progressively reveal a variety of emergent abilities of generative pre-trained language models with appropriate tuning or prompting methods, such as: - In-context learning: The model can better perform tasks when prefixed with a few annotated samples (i.e., demonstrations).

Abstract: Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

Large language models (LMs) are able to in-context learn -- perform a new task via inference alone by conditioning on a few input-label pairs (demonstrations) and making predictions for new inputs. However, there has been little understanding of how the model learns and which aspects of the demonstrations contribute to end task performance. In this paper, we show that ground truth demonstrations are in fact not required -- randomly replacing labels in the demonstrations barely hurts performance on a range of classification and multi-choce tasks, consistently over 12 different models including GPT-3. Instead, we find that other aspects of the demonstrations are the key drivers of end task performance, including the fact that they provide a few examples of (1) the label space, (2) the distribution of the input text, and (3) the overall format of the sequence.

- Chain-of-thought reasoning: The model can perform complex reasoning by generating a series of intermediate reasoning steps.

Abstract: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking.

- Zero-shot instruction: The model can handle customized requirements with a few or even zero annotated examples. One core commonality of these abilities is to allow for handling customized requirements with a few or even zero annotated examples. It's the appearance of these abilities that allows us to re-invent a new way for text evaluation–evaluating from the textual description, which can achieve customizable, multi-faceted, and train-free evaluation.

As illustrated in Fig. 2, to capture users' true desires, an evaluation protocol will be initially established based on (a) the task specification, which typically outlines how the text is generated (e.g., generate a response for a human based on the conversation.) (b) aspect definition that documents the details of desirable evaluation aspects (e.g., the response should be intuitive to understand).

Subsequently, each evaluation sample will be presented with the evaluated protocol with optionally moderate exemplar samples, which could facilitate the model's learning. Lastly, a large generative pre-trained model will be used to calculate how likely the text could be generated based on the above evaluation protocol, thus giving rise to our model's name: GPTScore.

GPTScore Framework

Approach

GPTScore

The core idea of GPTScore is that a generative pre-training model will assign a higher probability of high-quality generated text following a given instruction and context. Specifically, suppose that the text to be evaluated is h, the context information is S (e.g., source text or reference text), then GPTScore is defined as the following conditional probability:

GPTScore(h|d, a, S) = sum_t=1^m w_t log p(ht |h<t , T (d, a, S), θ),

where:

d is the task description
a is the aspect definition
w_t is the weight of the token at position t (treated equally in this work)
T(·) is a prompt template that defines the evaluation protocol, which is usually task-dependent and specified manually through prompt engineering

The few-shot setting with demonstration samples can be supported by extending the prompt template T with the demonstrations.

The prompt templates define how task description, aspect definition, and context are organized. Designing desirable prompts is a non-trivial task itself. In this work:

For GPT3-based models, prompts officially provided by OpenAI are used.
For instruction-based pre-trained models, prompts from NaturalInstruction are used since it's their main training source.

Abstract: Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks

How well can NLP models generalize to a variety of unseen tasks when provided with task instructions? To address this question, we first introduce Super-NaturalInstructions, a benchmark of 1,616 diverse NLP tasks and their expert-written instructions. Our collection covers 76 distinct task types, including but not limited to classification, extraction, infilling, sequence tagging, text rewriting, and text composition. This large and diverse collection of tasks enables rigorous benchmarking of cross-task generalization under instructions -- training models to follow instructions on a subset of tasks and evaluating them on the remaining unseen ones.

Scoring Dimension

GPTScore exhibits different variants in terms of diverse choices of texts being calculated. For example, given a generated hypothesis, we can calculate GPTScore either based on:

The source text (i.e., src->hypo, p(hypo|src))
The gold reference (i.e., ref->hypo, p(hypo|ref))

In this paper, the criteria for choosing GPTScore variants are mainly designed to align the protocol of human judgments that are used to evaluate the reliability of automated metrics.

Experiments

The experiments cover a broad range of natural language generation tasks:

Dialogue Response Generation
Text Summarization
Data-to-Text
Machine Translation

This involves 37 datasets and 22 evaluation aspects in total. The definition of aspects evaluated are shown below:

Table 1. The definition of aspects evaluated

| Aspect | Task | Definition | |-----------------------------------------------------------------------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Semantic Coverage (COV) | Summ | How many semantic content units from the reference text are covered by the generated text? | | Factuality (FAC) | Summ | Does the generated text preserve the factual statements of the source text? | | Consistency (CON) | Summ, Diag | Is the generated text consistent in the information it provides? | | Informativeness (INF) | Summ, D2T, Diag | How well does the generated text capture the key ideas of its source text? | | Coherence (COH) | Summ, Diag | How much does the generated text make sense? | | Relevance (REL) | Diag, Summ, D2T | How well is the generated text relevant to its source text? | | Fluency (FLU) | Diag, Summ, D2T, MT | Is the generated text well-written and grammatical? | | Accuracy (ACC) | MT | Are there inaccuracies, missing, or unfactual content in the generated text? | | Multidimensional Quality Metrics (MQM) | MT | How is the overall quality of the generated text? | | Interest (INT) | Diag | Is the generated text interesting? | | Engagement (ENG) | Diag | Is the generated text engaging? | | Specific (SPE) | Diag | Is the generated text generic or specific to the source text? | | Correctness (COR) | Diag | Is the generated text correct or was there a misunderstanding of the source text? | | Semantically appropriate (SEM) | Diag | Is the generated text semantically appropriate? | | Understandability (UND) | Diag | Is the generated text understandable? | | Error Recovery (ERR) | Diag | Is the system able to recover from errors that it makes? | | Diversity (DIV) | Diag | Is there diversity in the system responses? | | Depth (DEP) | Diag | Does the system discuss topics in depth? | | Likeability (LIK) | Diag | Does the system display a likeable personality? | | Flexibility (FLE) | Diag | Is the system flexible and adaptable to the user and their interests? | | Inquisitiveness (INQ) | Diag | Is the system inquisitive throughout the conversation? |

The evaluated models include:

ROUGE
PRISM
BERTScore
MoverScore
BARTScore
GPTScore variants based on GPT3, OPT, FLAN-T5, GPT2

Three scenarios are explored:

Vanilla (VAL): with non-instruction and non-demonstration
Instruction (IST): with instruction and non-demonstration
Instruction+demonstration (IDM): with instruction and demonstration

Main Results

Text Summarization

The introduction of instructions significantly improves the performance of GPTScore. The benefit from instruction is more stable for the decoder-only models (GPT2, OPT). For GPT3-based models: - GPT3-d01 is barely significantly better than GPT3-c01, which tries to balance power and speed. - GPT3-d03 performs better than GPT3-d01 significantly.

Machine Translation

- Introducing instruction (IST) significantly improves the performance in three aspects: accuracy, fluency, and MQM. - The combination of instruction and demonstration (IDM) brings gains for the evaluator with different model structures. - The evaluator built based on GPT3-c01 achieves comparable performance with GPT3-d01 and GPT3-d03.

Data-to-Text

- Introducing instruction (IST) can significantly improve performance, and introducing demonstration (DM) will further improve performance. - The decoder-only model is better at utilizing demonstration to achieve high performance. - GPT3 has strong compatibility with unformatted text. Named entities of the BAGEL dataset are replaced with a special token (e.g, X and Y ). For example, "X is a cafe restaurant", where "X" denotes the name of the cafe.

Dialogue Response Generation

The performance of GPT3-d01 is much better than GPT3-d03, even though both of them have the same model size. The average Spearman correlation of GPT3-d01 outperforms GPT3-d03 by 40.8 on the FED Turn-level dataset, and 5.5 on the FED dialogue-level.

Ablation Studies

Effectiveness of Demonstration

- The utilization of demonstration significantly improves the evaluation performance across aspects. - There is an upper bound on the performance gains from the introduction of the demonstration. For example, when the number of demonstration samples K>4, the performance of accuracy aspect is hard to improve further. - When the demonstration only has a few samples (such as K=1), small models (e.g., GPT3-a01) are prone to performance degradation.

Partial Order of Evaluation Aspects

To explore the correlation between aspects, an empirical analysis is conducted with the "interesting" (INT) aspect on the dialogue task. The definition of INT is combined with definitions of other aspects and the resulting performance is measured. Specifically, the definition of INT is expanded from "Is this response interesting to the conversation?" to "Is this an interesting response that is specific and engaging?". This boosts the Spearman correlation from 30.8 to 48.6. The best performance of 51.4 is achieved after combining five aspects (INT, ENG, SPE, COR, REL), which already exceeded 50.1 of the most potent scoring model GPT3-d01 with aspect definition built only on INT.

Conclusion

This paper proposes to leverage the emergent abilities from generative pre-training models to address intricate and ever-changing evaluation requirements. The proposed framework, GPTScore, is studied on multiple pre-trained language models with different structures, including GPT3 with a model size of 175B.

GPTScore has multiple benefits:

Customizability
Multi-faceted evaluation
Train-free

This enables flexible crafting of a metric that can support 22 evaluation aspects on 37 datasets without any learning process yet attain competitive performance. This work opens a new way to audit generative AI by utilizing generative AI.

URL: https://github.com/jinlanfu/GPTScore

Suggested labels

{'label-name': 'NLP-Evaluation', 'label-description': 'Focuses on evaluating natural language processing models and text generation quality.', 'confidence': 55.41}

irthomasthomas / undecidability

ClipNotes for paper: GPTScore - Evaluate as you desire #823

GPTScore: Evaluate as You Desire

Introduction

Approach

GPTScore

Scoring Dimension

Experiments

Main Results

Ablation Studies

Conclusion

Suggested labels

{'label-name': 'NLP-Evaluation', 'label-description': 'Focuses on evaluating natural language processing models and text generation quality.', 'confidence': 55.41}

Related content

811 similarity score: 0.91

706 similarity score: 0.89

333 similarity score: 0.88

769 similarity score: 0.88

817 similarity score: 0.88

715 similarity score: 0.87