Discuss ROUGE section in eval metrics

cleong110 commented 1 week ago

Within "evaluation metrics", talk about how ROUGE is not really intended for machine translation, and the pitfalls thereof.

https://stats.stackexchange.com/questions/301626/interpreting-rouge-scores

https://towardsdatascience.com/to-rouge-or-not-to-rouge-6a5f3552ea45

https://hyperskill.org/learn/step/29669

https://en.wikipedia.org/wiki/ROUGE_(metric)

https://aclanthology.org/P04-1077/ "Automatic Evaluation of Machine Translation Quality Using Longest Com- mon Subsequence and Skip-Bigram Statistics" is the one that talks about the ROUGE-L variation, this one is actually for MT

cleong110 commented 1 week ago

https://www.semanticscholar.org/paper/A-survey-on-Sign-Language-machine-translation-N%C3%BA%C3%B1ez-Marcos-Perez-de-Vi%C3%B1aspre/f9c2127e745a73e34f93e2090bb385adbc74c775 talks about ROUGE

Wait, now that I read further, https://aclanthology.org/W04-1013.pdf (the original ROUGE paper) says

In a separate study (Lin and Och, 2004), ROUGE-L, W, and S were also shown to be very effective in automatic evaluation of machine translation. The stability and reliability of ROUGE at different sample sizes was reported by the author in (Lin, 2004).

cleong110 commented 1 week ago

Hmm so maybe the lesson is just "don't use plain ROUGE, use ROUGE-L"

cleong110 commented 1 week ago

MT Eval: A qualitative approach https://diposit.ub.edu/dspace/bitstream/2445/65906/1/ECP_PhD_THESIS.pdf

cleong110 commented 1 week ago

“Non-standard metrics ROUGE is a metric common in automatic summarization but not in MT, and was never correlated with human judgement in a large study. In eight out of 14 papers, BLEU is used with a non-standard maximum ngram order, producing variants such as BLEU-1, BLEU-2, etc. Similar to ROUGE, these variants of BLEU have never been validated as metrics of translation quality, and their use is scientifically unmotivated.”

https://aclanthology.org/2023.acl-short.60

cleong110 commented 1 week ago

https://ieeexplore.ieee.org/document/5071230 corelates it with human evaluation for sumarization

"not great;"

cleong110 commented 1 week ago

This one also looks at it for summarization: https://www.semanticscholar.org/paper/Looking-for-a-Few-Good-Metrics%3A-ROUGE-and-its-Lin/cfad182f567b6a6d54a4dd517e0122692afca9fb?sort=relevance&queryString=translation

cleong110 commented 1 week ago

Aha! Someone evaluates it on MT! https://www.semanticscholar.org/paper/Automatic-Meta-evaluation-of-Low-Resource-Machine-Yu-Liu/221803ac4bd1500098efffc29e6e20256e17d581

cleong110 commented 1 week ago

I'm searching through papers that cite ROUGE for "translation evaluation" https://www.semanticscholar.org/paper/ROUGE%3A-A-Package-for-Automatic-Evaluation-of-Lin/60b05f32c32519a809f21642ef1eb3eaf3848008?sort=relevance&queryString=translation%20evaluation

cleong110 commented 1 week ago

@inproceedings{Wubben2010ParaphraseGA,
  title={Paraphrase Generation as Monolingual Translation: Data and Evaluation},
  author={Sander Wubben and Antal van den Bosch and Emiel Krahmer},
  booktitle={International Conference on Natural Language Generation},
  year={2010},
  url={https://api.semanticscholar.org/CorpusID:11507867}
}

cleong110 commented 1 week ago

@inproceedings{Iskender2021DoesSE,
  title={Does Summary Evaluation Survive Translation to Other Languages?},
  author={Neslihan Iskender and Oleg V. Vasilyev and Tim Polzehl and John Bohannon and Sebastian Moller},
  booktitle={North American Chapter of the Association for Computational Linguistics},
  year={2021},
  url={https://api.semanticscholar.org/CorpusID:237532546}
}

cleong110 commented 1 week ago

@article{Mohtashami2023LearningTQ,
  title={Learning Translation Quality Evaluation on Low Resource Languages from Large Language Models},
  author={Amirkeivan Mohtashami and Mauro Verzetti and Paul K. Rubenstein},
  journal={ArXiv},
  year={2023},
  volume={abs/2302.03491},
  url={https://api.semanticscholar.org/CorpusID:256627762}
}

has this to say, but doesn't cite anything for correlation.

cleong110 commented 1 week ago

Conspicuously absent from WMT:

@inproceedings{Blain2023FindingsOT,
  title={Findings of the WMT 2023 Shared Task on Quality Estimation},
  author={Frederic Blain and Chrysoula Zerva and Ricardo Ribeiro and Nuno M. Guerreiro and Diptesh Kanojia and Jos{\'e} G. C. de Souza and Beatriz Silva and T{\^a}nia Vaz and Jingxuan Yan and Fatemeh Azadi and Constantin Orasan and Andr{\'e} Martins},
  booktitle={Conference on Machine Translation},
  year={2023},
  url={https://api.semanticscholar.org/CorpusID:265608057}
}

or 2019

@inproceedings{Ma2019ResultsOT,
  title={Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges},
  author={Qingsong Ma and Johnny Wei and Ondrej Bojar and Yvette Graham},
  booktitle={Conference on Machine Translation},
  year={2019},
  url={https://api.semanticscholar.org/CorpusID:201742578}
}

cleong110 commented 1 week ago

@inproceedings{Sellam2020BLEURTLR,
  title={BLEURT: Learning Robust Metrics for Text Generation},
  author={Thibault Sellam and Dipanjan Das and Ankur P. Parikh},
  booktitle={Annual Meeting of the Association for Computational Linguistics},
  year={2020},
  url={https://api.semanticscholar.org/CorpusID:215548699}
}

BLEURT says it isn't good, and cite studies!

... but they seem to be for "generation", not translation specifically.

Also

Ah there we go.

Papers cited in BLEURT for ROUGE

dialogue

@article{Liu2016HowNT,
  title={How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation},
  author={Chia-Wei Liu and Ryan Lowe and Iulian Serban and Michael Noseworthy and Laurent Charlin and Joelle Pineau},
  journal={ArXiv},
  year={2016},
  volume={abs/1603.08023},
  url={https://api.semanticscholar.org/CorpusID:9197196}
}

"natural language evaluation"

@article{Chaganty2018ThePO,
  title={The price of debiasing automatic metrics in natural language evalaution},
  author={Arun Tejasvi Chaganty and Stephen Mussmann and Percy Liang},
  journal={ArXiv},
  year={2018},
  volume={abs/1807.02202},
  url={https://api.semanticscholar.org/CorpusID:49568810}
}

cleong110 commented 1 week ago

It is actually mentioned in WMT 2018

@inproceedings{Ma2018ResultsOT,
  title={Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance},
  author={Qingsong Ma and Ondrej Bojar and Yvette Graham},
  booktitle={Conference on Machine Translation},
  year={2018},
  url={https://api.semanticscholar.org/CorpusID:53246643}
}

... in passing

cleong110 commented 1 week ago

@inproceedings{Novikova2017WhyWN,
  title={Why We Need New Evaluation Metrics for NLG},
  author={Jekaterina Novikova and Ondrej Dusek and Amanda Cercas Curry and Verena Rieser},
  booktitle={Conference on Empirical Methods in Natural Language Processing},
  year={2017},
  url={https://api.semanticscholar.org/CorpusID:1929239}
}

cleong110 commented 1 week ago

COMET paper also leaves out ROUGE:

@article{Rei2020COMETAN,
  title={COMET: A Neural Framework for MT Evaluation},
  author={Ricardo Rei and Craig Alan Stewart and Ana C. Farinha and Alon Lavie},
  journal={ArXiv},
  year={2020},
  volume={abs/2009.09025},
  url={https://api.semanticscholar.org/CorpusID:221819581}
}

cleong110 commented 1 week ago

Here's a survey on MT Eval which does mention it:

@article{Mondal2023MachineTA,
  title={Machine translation and its evaluation: a study},
  author={Subrota Kumar Mondal and Haoxi Zhang and H M Dipu Kabir and Kan Ni and Hongning Dai},
  journal={Artificial Intelligence Review},
  year={2023},
  volume={56},
  pages={10137-10226},
  url={https://api.semanticscholar.org/CorpusID:257177987}
}

Aha!

OK this one comes to another conclusion.

On the other hand the only thing they cite is the original paper. So this is not a new correlation, but citing ROUGE itself, and ROUGE itself claims to be good for MT

No wait, I read it wrong, I should look at table 26 and 27

What I cannot seem to figure out is where these scores are coming from. It's quite unclear to me.

cleong110 commented 1 week ago

This one does provide some support for ROUGE-L as an MT metric

@inproceedings{Lin2004ORANGEAM,
  title={ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation},
  author={Chin-Yew Lin and Franz Josef Och},
  booktitle={International Conference on Computational Linguistics},
  year={2004},
  url={https://api.semanticscholar.org/CorpusID:7139779}
}