Open Wenshansilvia opened 9 months ago
This issue is to define new feature about the answer correctness.
There are some open source libraries of metrics that we may be able to use in our projects. For example, Rouge and MAUVE. Rouge Rouge-chinese MAUVE
I add Rouge metrics using rouge-score, referencing huggingface/evaluate and stanford-crfm/helm. For non-Latin languages like Chinese and others, it is optional to use a word cutter defined by user, similar to what rouge-chinese has done. If necessary, these tokenizers can be added later.
I add Rouge metrics using rouge-score, referencing huggingface/evaluate and stanford-crfm/helm. For non-Latin languages like Chinese and others, it is optional to use a word cutter defined by user, similar to what rouge-chinese has done. If necessary, these tokenizers can be added later.
There are many high-quality implemented metrics we can use in datasets. Besides, we can learn from them for those uncovered metrics.
Here are some metrics related to the answer and the papers mention them.
Answer/Query DisambigF1 Active retrieval augmented generation Answer Relevance RAGAS: Automated Evaluation of Retrieval Augmented Generation
Answer/Contexts FActScore FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation D-FActScore Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations
Answer/GT_Answer accuracy、EM、F1、Rouge BLEU, TER, chrF++ Lift yourself up: Retrieval-augmented text generation with self memory Q-BLEU-1 Towards a Better Metric for Evaluating Question Generation Systems citation recall/precision Enabling Large Language Models to Generate Text with Citations nF1 Hindsight: Posterior-guided training of retrievers for improved open-ended generation Rare F1 Retrieval augmentation reduces hallucination in conversation Disambiguation-Rouge PreWoMe: Exploiting Presuppositions as Working Memory for Long Form Question Answering BERTScore LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models Accuracy exact match、Assertion method matched、Accuracy plausible match、LCS、Edit distance Retrieval-based prompt selection for code-related few-shot learning perplexity Improving retrieval-augmented lms with compression and selective augmentation bits-per-byte Replug: Retrievalaugmented black-box language models. MAUVE MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence FrontiersEnabling Large Language Models to Generate Text with Citations Truthful and informative TruthfulQA: Measuring How Models Mimic Human Falsehoods
disambig-F1: Active retrieval augmented generation, ASQA: Factoid Questions Meet Long-Form Answers Use a RoBERTa-based model to normalize the answer and the ground truth answer, then compute the token-level F1 score between them.
I am going to implement the following 2 metrics: LCS Retrieval-based prompt selection for code-related few-shot learning Edit distance Retrieval-based prompt selection for code-related few-shot learning
I am going to implement the following 2 metrics: BLEU BLEU: a Method for Automatic Evaluation of Machine Translation Q-BLEU-1 Towards a Better Metric for Evaluating Question Generation Systems
Here are some metrics related to the answer and the papers mention them.
- Answer/Query DisambigF1 Active retrieval augmented generation Answer Relevance RAGAS: Automated Evaluation of Retrieval Augmented Generation
- Answer/Contexts FActScore FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation D-FActScore Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations
- Answer/GT_Answer accuracy、EM、F1、Rouge BLEU, TER, chrF++ Lift yourself up: Retrieval-augmented text generation with self memory Q-BLEU-1 Towards a Better Metric for Evaluating Question Generation Systems citation recall/precision Enabling Large Language Models to Generate Text with Citations nF1 Hindsight: Posterior-guided training of retrievers for improved open-ended generation Rare F1 Retrieval augmentation reduces hallucination in conversation Disambiguation-Rouge PreWoMe: Exploiting Presuppositions as Working Memory for Long Form Question Answering BERTScore LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models Accuracy exact match、Assertion method matched、Accuracy plausible match、LCS、Edit distance Retrieval-based prompt selection for code-related few-shot learning perplexity Improving retrieval-augmented lms with compression and selective augmentation bits-per-byte Replug: Retrievalaugmented black-box language models. MAUVE MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence FrontiersEnabling Large Language Models to Generate Text with Citations Truthful and informative TruthfulQA: Measuring How Models Mimic Human Falsehoods
I am going to implement F1, TER and chrF++ Lift yourself up: Retrieval-augmented text generation with self memory
Here are some metrics related to the answer and the papers mention them.
- Answer/Query DisambigF1 Active retrieval augmented generation Answer Relevance RAGAS: Automated Evaluation of Retrieval Augmented Generation
- Answer/Contexts FActScore FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation D-FActScore Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations
- Answer/GT_Answer accuracy、EM、F1、Rouge BLEU, TER, chrF++ Lift yourself up: Retrieval-augmented text generation with self memory Q-BLEU-1 Towards a Better Metric for Evaluating Question Generation Systems citation recall/precision Enabling Large Language Models to Generate Text with Citations nF1 Hindsight: Posterior-guided training of retrievers for improved open-ended generation Rare F1 Retrieval augmentation reduces hallucination in conversation Disambiguation-Rouge PreWoMe: Exploiting Presuppositions as Working Memory for Long Form Question Answering BERTScore LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models Accuracy exact match、Assertion method matched、Accuracy plausible match、LCS、Edit distance Retrieval-based prompt selection for code-related few-shot learning perplexity Improving retrieval-augmented lms with compression and selective augmentation bits-per-byte Replug: Retrievalaugmented black-box language models. MAUVE MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence FrontiersEnabling Large Language Models to Generate Text with Citations Truthful and informative TruthfulQA: Measuring How Models Mimic Human Falsehoods
I'm going to implement truthful and informative TruthfulQA: Measuring How Models Mimic Human Falsehoods
I am going to implement the following 2 metrics: BLEU BLEU: a Method for Automatic Evaluation of Machine Translation Q-BLEU-1 Towards a Better Metric for Evaluating Question Generation Systems
the Q-BLEU metric is to meature the answerability of questions generated by Automatic Question Generation system. It use if the question include relevant content words, named entitiesand question types or function words to meature the answerability. This metric is not useful for answer generation. I am going to implement another metric: perplexity Improving retrieval-augmented lms with compression and selective augmentation
@QianHaosheng @bugtig6351 @yuanpcr you can list all potential metrics for the
generate
task in this issue. For more details about thegenerate
task, you can refer to issue #12 .