Open cleong110 opened 1 week ago
Wait, now that I read further, https://aclanthology.org/W04-1013.pdf (the original ROUGE paper) says
In a separate study (Lin and Och, 2004), ROUGE-L, W, and S were also shown to be very effective in automatic evaluation of machine translation. The stability and reliability of ROUGE at different sample sizes was reported by the author in (Lin, 2004).
Hmm so maybe the lesson is just "don't use plain ROUGE, use ROUGE-L"
MT Eval: A qualitative approach https://diposit.ub.edu/dspace/bitstream/2445/65906/1/ECP_PhD_THESIS.pdf
“Non-standard metrics ROUGE is a metric common in automatic summarization but not in MT, and was never correlated with human judgement in a large study. In eight out of 14 papers, BLEU is used with a non-standard maximum ngram order, producing variants such as BLEU-1, BLEU-2, etc. Similar to ROUGE, these variants of BLEU have never been validated as metrics of translation quality, and their use is scientifically unmotivated.”
https://ieeexplore.ieee.org/document/5071230 corelates it with human evaluation for sumarization
"not great;"
This one also looks at it for summarization: https://www.semanticscholar.org/paper/Looking-for-a-Few-Good-Metrics%3A-ROUGE-and-its-Lin/cfad182f567b6a6d54a4dd517e0122692afca9fb?sort=relevance&queryString=translation
I'm searching through papers that cite ROUGE for "translation evaluation" https://www.semanticscholar.org/paper/ROUGE%3A-A-Package-for-Automatic-Evaluation-of-Lin/60b05f32c32519a809f21642ef1eb3eaf3848008?sort=relevance&queryString=translation%20evaluation
@inproceedings{Wubben2010ParaphraseGA,
title={Paraphrase Generation as Monolingual Translation: Data and Evaluation},
author={Sander Wubben and Antal van den Bosch and Emiel Krahmer},
booktitle={International Conference on Natural Language Generation},
year={2010},
url={https://api.semanticscholar.org/CorpusID:11507867}
}
@inproceedings{Iskender2021DoesSE,
title={Does Summary Evaluation Survive Translation to Other Languages?},
author={Neslihan Iskender and Oleg V. Vasilyev and Tim Polzehl and John Bohannon and Sebastian Moller},
booktitle={North American Chapter of the Association for Computational Linguistics},
year={2021},
url={https://api.semanticscholar.org/CorpusID:237532546}
}
@article{Mohtashami2023LearningTQ,
title={Learning Translation Quality Evaluation on Low Resource Languages from Large Language Models},
author={Amirkeivan Mohtashami and Mauro Verzetti and Paul K. Rubenstein},
journal={ArXiv},
year={2023},
volume={abs/2302.03491},
url={https://api.semanticscholar.org/CorpusID:256627762}
}
has this to say, but doesn't cite anything for correlation.
Conspicuously absent from WMT:
@inproceedings{Blain2023FindingsOT,
title={Findings of the WMT 2023 Shared Task on Quality Estimation},
author={Frederic Blain and Chrysoula Zerva and Ricardo Ribeiro and Nuno M. Guerreiro and Diptesh Kanojia and Jos{\'e} G. C. de Souza and Beatriz Silva and T{\^a}nia Vaz and Jingxuan Yan and Fatemeh Azadi and Constantin Orasan and Andr{\'e} Martins},
booktitle={Conference on Machine Translation},
year={2023},
url={https://api.semanticscholar.org/CorpusID:265608057}
}
or 2019
@inproceedings{Ma2019ResultsOT,
title={Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges},
author={Qingsong Ma and Johnny Wei and Ondrej Bojar and Yvette Graham},
booktitle={Conference on Machine Translation},
year={2019},
url={https://api.semanticscholar.org/CorpusID:201742578}
}
@inproceedings{Sellam2020BLEURTLR,
title={BLEURT: Learning Robust Metrics for Text Generation},
author={Thibault Sellam and Dipanjan Das and Ankur P. Parikh},
booktitle={Annual Meeting of the Association for Computational Linguistics},
year={2020},
url={https://api.semanticscholar.org/CorpusID:215548699}
}
BLEURT says it isn't good, and cite studies!
... but they seem to be for "generation", not translation specifically.
Also
Ah there we go.
Papers cited in BLEURT for ROUGE
dialogue
@article{Liu2016HowNT,
title={How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation},
author={Chia-Wei Liu and Ryan Lowe and Iulian Serban and Michael Noseworthy and Laurent Charlin and Joelle Pineau},
journal={ArXiv},
year={2016},
volume={abs/1603.08023},
url={https://api.semanticscholar.org/CorpusID:9197196}
}
"natural language evaluation"
@article{Chaganty2018ThePO,
title={The price of debiasing automatic metrics in natural language evalaution},
author={Arun Tejasvi Chaganty and Stephen Mussmann and Percy Liang},
journal={ArXiv},
year={2018},
volume={abs/1807.02202},
url={https://api.semanticscholar.org/CorpusID:49568810}
}
It is actually mentioned in WMT 2018
@inproceedings{Ma2018ResultsOT,
title={Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance},
author={Qingsong Ma and Ondrej Bojar and Yvette Graham},
booktitle={Conference on Machine Translation},
year={2018},
url={https://api.semanticscholar.org/CorpusID:53246643}
}
... in passing
@inproceedings{Novikova2017WhyWN,
title={Why We Need New Evaluation Metrics for NLG},
author={Jekaterina Novikova and Ondrej Dusek and Amanda Cercas Curry and Verena Rieser},
booktitle={Conference on Empirical Methods in Natural Language Processing},
year={2017},
url={https://api.semanticscholar.org/CorpusID:1929239}
}
COMET paper also leaves out ROUGE:
@article{Rei2020COMETAN,
title={COMET: A Neural Framework for MT Evaluation},
author={Ricardo Rei and Craig Alan Stewart and Ana C. Farinha and Alon Lavie},
journal={ArXiv},
year={2020},
volume={abs/2009.09025},
url={https://api.semanticscholar.org/CorpusID:221819581}
}
Here's a survey on MT Eval which does mention it:
@article{Mondal2023MachineTA,
title={Machine translation and its evaluation: a study},
author={Subrota Kumar Mondal and Haoxi Zhang and H M Dipu Kabir and Kan Ni and Hongning Dai},
journal={Artificial Intelligence Review},
year={2023},
volume={56},
pages={10137-10226},
url={https://api.semanticscholar.org/CorpusID:257177987}
}
Aha!
OK this one comes to another conclusion.
On the other hand the only thing they cite is the original paper. So this is not a new correlation, but citing ROUGE itself, and ROUGE itself claims to be good for MT
No wait, I read it wrong, I should look at table 26 and 27
What I cannot seem to figure out is where these scores are coming from. It's quite unclear to me.
This one does provide some support for ROUGE-L as an MT metric
@inproceedings{Lin2004ORANGEAM,
title={ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation},
author={Chin-Yew Lin and Franz Josef Och},
booktitle={International Conference on Computational Linguistics},
year={2004},
url={https://api.semanticscholar.org/CorpusID:7139779}
}
OK, a new tack: trace back through SLT papers, see if I can find who decided to use ROUGE and why:
They also don't compare with previous works I think?
Oh here's the shared task of course: https://www.semanticscholar.org/paper/Findings-of-the-Second-WMT-Shared-Task-on-Sign-M%C3%BCller-Alikhani/604cc00ff6a4d57d97856b49be4df89452cf30a8
Maybe I can filter papers that use RWTH-PHOENIX-WEATHER? That predates the 2018 "Neural SLT" paper
Also, looking at their baselines, it's BLEU and TER for them.
44 citations in the range
Searching the ROUGE citations for "sign" in the range between 2004 and 2018: https://www.semanticscholar.org/paper/ROUGE%3A-A-Package-for-Automatic-Evaluation-of-Lin/60b05f32c32519a809f21642ef1eb3eaf3848008?year%5B0%5D=2003&year%5B1%5D=2018&sort=relevance&page=2&queryString=sign
Scrolled through, found ONLY these 2 sign language papers: And the second one cites the first one, again saying that ROUGE is commonly used
Within "evaluation metrics", talk about how ROUGE is not really intended for machine translation, and the pitfalls thereof.
https://stats.stackexchange.com/questions/301626/interpreting-rouge-scores
https://towardsdatascience.com/to-rouge-or-not-to-rouge-6a5f3552ea45
https://hyperskill.org/learn/step/29669
https://en.wikipedia.org/wiki/ROUGE_(metric)
https://aclanthology.org/P04-1077/ "Automatic Evaluation of Machine Translation Quality Using Longest Com- mon Subsequence and Skip-Bigram Statistics" is the one that talks about the ROUGE-L variation, this one is actually for MT