Human-Centric Evaluation.
"Even though human evaluation methods are useful in these scenarios for evaluating aspects like coherency, naturalness, or fluency, aspects like diversity or creativity may be difficult for human judges to assess as they have no knowledge about the dataset that the model is trained on. Language models can learn to copy from the training dataset and generate samples that a human judge will rate as high in quality, but may fail in generating diverse samples (i.e., samples that are very different from training samples), as has been observed in social chatbots. A language model optimized only for perplexity may generate coherent but bland responses."
"an automatic metric for measuring the similarity of a generated sentence against a set of human-written sentences using a consensus-based protocol."
2.2. Distance-Based Evaluation Metrics for Content Selection
Machine-Learned Metrics.
"n build machine-learned models (trained on human judgment data) to mimic human judges to measure many quality metrics of output, such as factual correctness, naturalness, fluency, coherence, etc."
SENTBERT (fine-tuned BERT to optimize the BERT parameters)
BERTSCORE
Metrics for NLG evaluation:
Human-Centric Evaluation. "Even though human evaluation methods are useful in these scenarios for evaluating aspects like coherency, naturalness, or fluency, aspects like diversity or creativity may be difficult for human judges to assess as they have no knowledge about the dataset that the model is trained on. Language models can learn to copy from the training dataset and generate samples that a human judge will rate as high in quality, but may fail in generating diverse samples (i.e., samples that are very different from training samples), as has been observed in social chatbots. A language model optimized only for perplexity may generate coherent but bland responses."
Untrained Automatic Metrics. 2.1. n-gram Overlap Metrics for Content Selection BLEU (Bilingual Evaluation Understudy)
ROGUE (Recall-Oriented Understudy for Gisting Evaluation)
CIDEr (Consensus-based Image Description Evaluation)
Machine-Learned Metrics. "n build machine-learned models (trained on human judgment data) to mimic human judges to measure many quality metrics of output, such as factual correctness, naturalness, fluency, coherence, etc." SENTBERT (fine-tuned BERT to optimize the BERT parameters) BERTSCORE
Source 1: Evaluation of Text Generation: A Survey