ml-projects-kiel / OpenCampus-ApplicationofTransformers

2 stars 0 forks source link

Train and Evaluate model #10

Open JaNoNi opened 2 years ago

JaNoNi commented 2 years ago

Metrics for NLG evaluation:

  1. Human-Centric Evaluation. "Even though human evaluation methods are useful in these scenarios for evaluating aspects like coherency, naturalness, or fluency, aspects like diversity or creativity may be difficult for human judges to assess as they have no knowledge about the dataset that the model is trained on. Language models can learn to copy from the training dataset and generate samples that a human judge will rate as high in quality, but may fail in generating diverse samples (i.e., samples that are very different from training samples), as has been observed in social chatbots. A language model optimized only for perplexity may generate coherent but bland responses."

  2. Untrained Automatic Metrics. 2.1. n-gram Overlap Metrics for Content Selection BLEU (Bilingual Evaluation Understudy)

    • "bleu achieves strongest correlation with human assessment, but does not significantly outperform the best-performing rouge variant"

    ROGUE (Recall-Oriented Understudy for Gisting Evaluation)

    • rouge-l measures the longest matching sequence of words using longest common sub-sequence (LCS)
    • rouge-s (less commonly used) measures skip-bigram15-based co-occurrence statistics
    • rouge-su (less commonly used) measures skip-bigram and unigram-based co-occurrence statistics.
    • "Compared to bleu, rouge focuses on recall rather than precision and is more interpretable than bleu."
    • "rouge’s reliance on n-gram matching can be an issue, especially for long-text generation tasks"

    CIDEr (Consensus-based Image Description Evaluation)

    • "an automatic metric for measuring the similarity of a generated sentence against a set of human-written sentences using a consensus-based protocol." 2.2. Distance-Based Evaluation Metrics for Content Selection
  3. Machine-Learned Metrics. "n build machine-learned models (trained on human judgment data) to mimic human judges to measure many quality metrics of output, such as factual correctness, naturalness, fluency, coherence, etc." SENTBERT (fine-tuned BERT to optimize the BERT parameters) BERTSCORE

image Source 1: Evaluation of Text Generation: A Survey