sign-language-processing / sign-language-processing.github.io

Documentation and background of sign language processing
99 stars 9 forks source link

Discuss ROUGE section in eval metrics #87

Open cleong110 opened 1 week ago

cleong110 commented 1 week ago

Within "evaluation metrics", talk about how ROUGE is not really intended for machine translation, and the pitfalls thereof.

https://stats.stackexchange.com/questions/301626/interpreting-rouge-scores

https://towardsdatascience.com/to-rouge-or-not-to-rouge-6a5f3552ea45

https://hyperskill.org/learn/step/29669

https://en.wikipedia.org/wiki/ROUGE_(metric)

https://aclanthology.org/P04-1077/ "Automatic Evaluation of Machine Translation Quality Using Longest Com- mon Subsequence and Skip-Bigram Statistics" is the one that talks about the ROUGE-L variation, this one is actually for MT

cleong110 commented 1 week ago

https://www.semanticscholar.org/paper/A-survey-on-Sign-Language-machine-translation-N%C3%BA%C3%B1ez-Marcos-Perez-de-Vi%C3%B1aspre/f9c2127e745a73e34f93e2090bb385adbc74c775 talks about ROUGE

Wait, now that I read further, https://aclanthology.org/W04-1013.pdf (the original ROUGE paper) says

In a separate study (Lin and Och, 2004), ROUGE-L, W, and S were also shown to be very effective in automatic evaluation of machine translation. The stability and reliability of ROUGE at different sample sizes was reported by the author in (Lin, 2004).

cleong110 commented 1 week ago

Hmm so maybe the lesson is just "don't use plain ROUGE, use ROUGE-L"

cleong110 commented 1 week ago

MT Eval: A qualitative approach https://diposit.ub.edu/dspace/bitstream/2445/65906/1/ECP_PhD_THESIS.pdf

cleong110 commented 1 week ago

“Non-standard metrics ROUGE is a metric common in automatic summarization but not in MT, and was never correlated with human judgement in a large study. In eight out of 14 papers, BLEU is used with a non-standard maximum ngram order, producing variants such as BLEU-1, BLEU-2, etc. Similar to ROUGE, these variants of BLEU have never been validated as metrics of translation quality, and their use is scientifically unmotivated.”

https://aclanthology.org/2023.acl-short.60

cleong110 commented 1 week ago

https://ieeexplore.ieee.org/document/5071230 corelates it with human evaluation for sumarization

"not great;"

cleong110 commented 1 week ago

This one also looks at it for summarization: https://www.semanticscholar.org/paper/Looking-for-a-Few-Good-Metrics%3A-ROUGE-and-its-Lin/cfad182f567b6a6d54a4dd517e0122692afca9fb?sort=relevance&queryString=translation

cleong110 commented 1 week ago

Aha! Someone evaluates it on MT! https://www.semanticscholar.org/paper/Automatic-Meta-evaluation-of-Low-Resource-Machine-Yu-Liu/221803ac4bd1500098efffc29e6e20256e17d581

cleong110 commented 1 week ago

I'm searching through papers that cite ROUGE for "translation evaluation" https://www.semanticscholar.org/paper/ROUGE%3A-A-Package-for-Automatic-Evaluation-of-Lin/60b05f32c32519a809f21642ef1eb3eaf3848008?sort=relevance&queryString=translation%20evaluation

cleong110 commented 1 week ago
@inproceedings{Wubben2010ParaphraseGA,
  title={Paraphrase Generation as Monolingual Translation: Data and Evaluation},
  author={Sander Wubben and Antal van den Bosch and Emiel Krahmer},
  booktitle={International Conference on Natural Language Generation},
  year={2010},
  url={https://api.semanticscholar.org/CorpusID:11507867}
}

image

image

cleong110 commented 1 week ago
@inproceedings{Iskender2021DoesSE,
  title={Does Summary Evaluation Survive Translation to Other Languages?},
  author={Neslihan Iskender and Oleg V. Vasilyev and Tim Polzehl and John Bohannon and Sebastian Moller},
  booktitle={North American Chapter of the Association for Computational Linguistics},
  year={2021},
  url={https://api.semanticscholar.org/CorpusID:237532546}
}

image image

image image

cleong110 commented 1 week ago
@article{Mohtashami2023LearningTQ,
  title={Learning Translation Quality Evaluation on Low Resource Languages from Large Language Models},
  author={Amirkeivan Mohtashami and Mauro Verzetti and Paul K. Rubenstein},
  journal={ArXiv},
  year={2023},
  volume={abs/2302.03491},
  url={https://api.semanticscholar.org/CorpusID:256627762}
}

has this to say, but doesn't cite anything for correlation. image

cleong110 commented 1 week ago

Conspicuously absent from WMT:

@inproceedings{Blain2023FindingsOT,
  title={Findings of the WMT 2023 Shared Task on Quality Estimation},
  author={Frederic Blain and Chrysoula Zerva and Ricardo Ribeiro and Nuno M. Guerreiro and Diptesh Kanojia and Jos{\'e} G. C. de Souza and Beatriz Silva and T{\^a}nia Vaz and Jingxuan Yan and Fatemeh Azadi and Constantin Orasan and Andr{\'e} Martins},
  booktitle={Conference on Machine Translation},
  year={2023},
  url={https://api.semanticscholar.org/CorpusID:265608057}
}

or 2019

@inproceedings{Ma2019ResultsOT,
  title={Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges},
  author={Qingsong Ma and Johnny Wei and Ondrej Bojar and Yvette Graham},
  booktitle={Conference on Machine Translation},
  year={2019},
  url={https://api.semanticscholar.org/CorpusID:201742578}
}
cleong110 commented 1 week ago
@inproceedings{Sellam2020BLEURTLR,
  title={BLEURT: Learning Robust Metrics for Text Generation},
  author={Thibault Sellam and Dipanjan Das and Ankur P. Parikh},
  booktitle={Annual Meeting of the Association for Computational Linguistics},
  year={2020},
  url={https://api.semanticscholar.org/CorpusID:215548699}
}

BLEURT says it isn't good, and cite studies! image

... but they seem to be for "generation", not translation specifically.

Also

image Ah there we go.

Papers cited in BLEURT for ROUGE

dialogue

@article{Liu2016HowNT,
  title={How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation},
  author={Chia-Wei Liu and Ryan Lowe and Iulian Serban and Michael Noseworthy and Laurent Charlin and Joelle Pineau},
  journal={ArXiv},
  year={2016},
  volume={abs/1603.08023},
  url={https://api.semanticscholar.org/CorpusID:9197196}
}

"natural language evaluation"

@article{Chaganty2018ThePO,
  title={The price of debiasing automatic metrics in natural language evalaution},
  author={Arun Tejasvi Chaganty and Stephen Mussmann and Percy Liang},
  journal={ArXiv},
  year={2018},
  volume={abs/1807.02202},
  url={https://api.semanticscholar.org/CorpusID:49568810}
}

image image

image

cleong110 commented 1 week ago

It is actually mentioned in WMT 2018

@inproceedings{Ma2018ResultsOT,
  title={Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance},
  author={Qingsong Ma and Ondrej Bojar and Yvette Graham},
  booktitle={Conference on Machine Translation},
  year={2018},
  url={https://api.semanticscholar.org/CorpusID:53246643}
}

... in passing image

cleong110 commented 1 week ago
@inproceedings{Novikova2017WhyWN,
  title={Why We Need New Evaluation Metrics for NLG},
  author={Jekaterina Novikova and Ondrej Dusek and Amanda Cercas Curry and Verena Rieser},
  booktitle={Conference on Empirical Methods in Natural Language Processing},
  year={2017},
  url={https://api.semanticscholar.org/CorpusID:1929239}
}

image

cleong110 commented 1 week ago

COMET paper also leaves out ROUGE:

@article{Rei2020COMETAN,
  title={COMET: A Neural Framework for MT Evaluation},
  author={Ricardo Rei and Craig Alan Stewart and Ana C. Farinha and Alon Lavie},
  journal={ArXiv},
  year={2020},
  volume={abs/2009.09025},
  url={https://api.semanticscholar.org/CorpusID:221819581}
}
cleong110 commented 1 week ago

Here's a survey on MT Eval which does mention it:

@article{Mondal2023MachineTA,
  title={Machine translation and its evaluation: a study},
  author={Subrota Kumar Mondal and Haoxi Zhang and H M Dipu Kabir and Kan Ni and Hongning Dai},
  journal={Artificial Intelligence Review},
  year={2023},
  volume={56},
  pages={10137-10226},
  url={https://api.semanticscholar.org/CorpusID:257177987}
}

Aha! image image image

OK this one comes to another conclusion.

On the other hand the only thing they cite is the original paper. So this is not a new correlation, but citing ROUGE itself, and ROUGE itself claims to be good for MT

No wait, I read it wrong, I should look at table 26 and 27 image

image

What I cannot seem to figure out is where these scores are coming from. It's quite unclear to me.

cleong110 commented 1 week ago

This one does provide some support for ROUGE-L as an MT metric

@inproceedings{Lin2004ORANGEAM,
  title={ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation},
  author={Chin-Yew Lin and Franz Josef Och},
  booktitle={International Conference on Computational Linguistics},
  year={2004},
  url={https://api.semanticscholar.org/CorpusID:7139779}
}

image image image

cleong110 commented 1 week ago

OK, a new tack: trace back through SLT papers, see if I can find who decided to use ROUGE and why:

cleong110 commented 1 week ago

https://www.semanticscholar.org/paper/Neural-Sign-Language-Translation-Camg%C3%B6z-Hadfield/644602c65a5d8f30e62be027eb7b47f7c335191a

They also don't compare with previous works I think?

cleong110 commented 1 week ago

Oh here's the shared task of course: https://www.semanticscholar.org/paper/Findings-of-the-Second-WMT-Shared-Task-on-Sign-M%C3%BCller-Alikhani/604cc00ff6a4d57d97856b49be4df89452cf30a8

image

https://www.semanticscholar.org/paper/Findings-of-the-First-WMT-Shared-Task-on-Sign-M%C3%BCller-Ebling/6140abf6acd1a3594a69c24edf1cbe448489e6ef

cleong110 commented 1 week ago

Maybe I can filter papers that use RWTH-PHOENIX-WEATHER? That predates the 2018 "Neural SLT" paper

https://www.semanticscholar.org/paper/RWTH-PHOENIX-Weather%3A-A-Large-Vocabulary-Sign-and-Forster-Schmidt/29228179df78b2bc28c0c65cea2f1a43132993c6

Also, looking at their baselines, it's BLEU and TER for them.

cleong110 commented 1 week ago

image 44 citations in the range

cleong110 commented 1 week ago

https://www.semanticscholar.org/paper/Sign-language-machine-translation-overkill-Stein-Schmidt/979f1e4c999c7cf5e757bbb1dadc95f016feacf8

Uses BLEU and TER

cleong110 commented 1 week ago

https://www.semanticscholar.org/paper/Building-a-sign-language-corpus-for-use-in-machine-Morrissey-Somers/1727fc2394b2829c40b472af01f2d55b2f3262ff http://doras.dcu.ie/16040/1/Building_a_Sign_Language_corpus_for_use_in_Machine_Translation.pdf

No scores

cleong110 commented 1 week ago

https://www.semanticscholar.org/search?year%5B0%5D=2004&year%5B1%5D=2018&q=sign%20language%20rouge&sort=relevance&page=2

search turns up not much

cleong110 commented 1 week ago

Searching the ROUGE citations for "sign" in the range between 2004 and 2018: https://www.semanticscholar.org/paper/ROUGE%3A-A-Package-for-Automatic-Evaluation-of-Lin/60b05f32c32519a809f21642ef1eb3eaf3848008?year%5B0%5D=2003&year%5B1%5D=2018&sort=relevance&page=2&queryString=sign

Scrolled through, found ONLY these 2 sign language papers: image And the second one cites the first one, again saying that ROUGE is commonly used