Investigation: Datasets for Evaluation of Translation

hhuangMITRE commented 2 years ago

Identify potential datasets of interest for evaluation on Azure Translation component.

Current datasets gathered:

WMT (Workshop on Statistical Machine Translation) – Contains a collection of machine translation tasks from the annual Conference on Machine Translation. Text sources include news, biomedical documents, news, and chats. We downloaded the WMT-19 data from hugging face (https://huggingface.co/datasets/wmt19).

Languages available (to English): Czech, Chinese, Estonian, Finnish, French, German, Gujarati, Hindi, Kazakh, Latvian, Lithuanian, Romanian, Russian, Turkish. Approx 3000 lines each.

Open Subtitles 2018 (https://opus.nlpl.eu/OpenSubtitles-v2018.php ) – A collection of multilingual movie transcripts.

Languages available (to English): Arabic, Chinese (Simplified, Traditional), French, Japanese, Korean, Persian, Russian, Spanish, Tagalog, Thai, Turkish, Urdu, Vietnamese

hhuangMITRE commented 2 years ago

For now, the WMT and Open Subtitles datasets contain a sufficient number of translated text for preliminary assessment.

hhuangMITRE commented 2 years ago

Howard’s notes:

CSLU-22 – Contains formatted pinyin for Chinese characters. Turns out both Olive and Vista return Unicode, we will need to find a way to combine them with CSLU transcripts. Suggestion – Use https://pypi.org/project/dragonmapper/ to convert VISTA and Olive results to Pinyin and use a similar approach to convert CSLU results. Since CSLU provides the intended vowel accent character this can be done without an additional software package. Better- Want an audio to text that also contains Unicode transcriptions.

Language Detection: https://www.kaggle.com/datasets/basilb2s/language-detection

Audio Transcription (w/ Native Languages): https://www.kaggle.com/datasets/bryanpark/russian-single-speaker-speech-dataset https://github.com/snakers4/open_stt/ https://openslr.org/68/ - mobile phone data (Chinese)

Translation Datasets (Text to Text) https://www.kaggle.com/datasets/dhruvildave/en-fr-translation-dataset https://metatext.io/datasets-list/translation-task#:~:text=Dataset%20is%20a%20multilingual%20speech,number%20of%20speakers%20is%2078K. Translation Datasets (Audio to Text)

Megane’s notes: Metrics: · Some metrics perfom well on certain languages but are weak on other languages (language bias problem) · Some metrics rely on a lot of language features or linguistic information, makes it hard for other researchers to repeat experiments · ROUGUE o Recall-Oriented Understudy for Gisting Evaluation o compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation o Metrics: § ROUGE-N: overlap of n-grams § ROUGE-1: overlap of unigram (each word) § ROUGE-2: overlap of bigrams § ROUGE-L: Longest Common Subsequence (LCS) based statistics · The longest common subsequence problem takes into account sentence-level structure similarity naturally and identifies longest co-occurring in sequence n-grams automatically. § ROUGE-W: weighted LCS-based statistics that favor consecutive LCSes § ROUGE-S: Skip-bigram based co-occurrence statistics. Skip-bigram is any pair of words in their sentence order § ROUGE-SU: Skip-bigram plus unigram-based co-occurrence statistics § Compare against human translated output o ROUGE does not account for different words that have the same meaning, measures syntactical matches rather than semantics · BLEU (see this one most often) o bilingual evaluation understudy o compare machine translation to human one o score is between 0 and 1, unnecessary to get 1 to be a good translation § not even human translations would get 1 (or at least rarely would), since 1 means it’s identical to the reference translation o brevity penalty o weighted geometric mean of all the modified n-gram precisions, multiplied by the brevity penalty o frequently been reported as correlating well with human judgement o cannot deal with languages without word boundaries o has been argued that although BLEU has significant advantages, there is no guarantee that an increase in BLEU score is an indicator of improved translation quality · NIST (based on BLEU) o BLEU calculates n-gram precision weighing each equally o NIST calculates how informative a particular n-gram is § when a correct n-gram is found, the rarer that n-gram is, the more weight it will be given o brevity penalty less strict, small variations in translation length do not impact the overall score as much · METEOR o based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision o has stemming and synonymy matching, along with the standard exact word matching o designed to fix some of the problems found in BLEU, also produce good correlation with human judgement at the sentence or segment level § BLEU seeks correlation at the corpus level · LEPOR o designed with the factors of enhanced length penalty, precision, n-gram word order penalty, and recall o punish machine translation if it is longer or shorter than the reference translation o hLEPOR investigates the integration of linguistic features, such as part of speech § e.g. if a token of output sentence is a verb while it is expected to be a noun, then it is penalized § if the POS is the same but the exact word is not the same, e.g. good vs nice, not penalized o nLEPOR adds n-gram features § n-gram based language independent MT evaluation metric employing the factors of modified sentence length penalty, position difference penalty, n-gram precision and n-gram recall

Datasets WMT · available on HuggingFace · can create custom datasets combining multiple years · WMT 14: o Czech, German, French, Hindi, Russian o Training data: § Mainly taken from version 7 of the Europarl corpus · About 50 million words of training data per language § Additional training data is taken from the News Commentary corpus · About 3 million words (per language?) § 2013 Common Crawl corpus (web sources) § For English-Hindi HindEnCorp is used o Test data: news
· WMT 15: o Czech, German, French, Finnish, Russian o Training data: same as 2014 · WMT 16: o Czech, German, Finnish, Romanian, Russian, Turkish o Training data: same as 2014 § Romanian-English and Turkish-English uses the SETIMES2 corpus § 2016 Common Crawl · WMT 17: o Chinese, Czech, Finnish, German, Latvian, Russian, Turkish o Training data: § mainly taken from public data sources such as the Europarl corpus, and the UN corpus § Additional training data is taken from the News Commentary corpus, which is re-extracted every year from the task · WMT 18: o Chinese, Czech, Estonian, Finish, German, Kazakh, Russian, Turkish o Training data: § Same as 2017 § Also includes ParaCrawl, a new crawled corpus for English to Czech, Estonian, Finnish, German and Russian · As this is the first release, it is potentially noisy · WMT 19: o Chinese, Czech, Finnish, German, Gujarati, Kazakh, Lithuanian, Russian o Training data: same as 2018 o Test data: § created from a sample of online newspapers from September-November 2018 § For the established languages (i.e. English to/from Chinese, Czech, German, Finnish and Russian) the English-X and X-English test sets will be distinct, and only consist of documents created originally in the source language. § For the new languages (i.e English to/from Gujarati, Kazakh and Lithuanian) the test sets include 50% English-X translation, and 50% X-English translation. § In previous recent tasks, all the test data was created using the latter method. OpenSubtitles · https://opus.nlpl.eu/OpenSubtitles-v2018.php · Sourced from movie and TV subtitles · Czech, Bulgarian, Greek, Spanish, Turkish, French, Polish, Russian, Arabic, Portuguese, Chinese

2008/2010 NIST Metrics for Machine Translation (MetricsMaTr) GALE Evaluation Set · Chinese, Arabic · 149 documents with corresponding reference translations (Arabic-to-English and Chinese-to-English), system translations and human assessments MuST-C (used for IWSLT 2022) · multilingual speech translation corpus · for training of end-to-end systems for speech translation from English into several languages · audio has manual transcriptions included (can do text-to-text) · audio recordings from TED talks · several hundred hours of audio recordings which are automatically aligned at the sentence level with their manual transcriptions and translations · v1.0: o English-to-{Dutch,French,German,Italian,Portuguese,Romanian,Russian,Spanish} · v1.2: o English-to-{Arabic, Chinese, Czech, Dutch, French, German, Italian, Persian, Portuguese, Romanian, Russian, Spanish, Turkish, Vietnamese} o (includes the 8 language directions of release v1.0) · TED talks are copyrighted by TED Conference LLC and licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 o MuST-C is released under the same Creative Commons Attribution-NonCommercial-NoDerivs 4.0 License.

Howard’s notes:

CSLU-22 – Contains formatted pinyin for Chinese characters. Turns out both Olive and Vista return Unicode, we will need to find a way to combine them with CSLU transcripts. Suggestion – Use https://pypi.org/project/dragonmapper/ to convert VISTA and Olive results to Pinyin and use a similar approach to convert CSLU results. Since CSLU provides the intended vowel accent character this can be done without an additional software package. Better- Want an audio to text that also contains Unicode transcriptions.

Language Detection: https://www.kaggle.com/datasets/basilb2s/language-detection

Audio Transcription (w/ Native Languages): https://www.kaggle.com/datasets/bryanpark/russian-single-speaker-speech-dataset https://github.com/snakers4/open_stt/ https://openslr.org/68/ - mobile phone data (Chinese)

Translation Datasets (Text to Text) https://www.kaggle.com/datasets/dhruvildave/en-fr-translation-dataset https://metatext.io/datasets-list/translation-task#:~:text=Dataset%20is%20a%20multilingual%20speech,number%20of%20speakers%20is%2078K.

openmpf / openmpf-evaluation

Investigation: Datasets for Evaluation of Translation #6