marzenakrp / demetr

Repository for DEMETR: Diagnosing Evaluation Metrics for Translation
MIT License
15 stars 2 forks source link

:ear_of_rice: DEMETR: Diagnosing Evaluation Metrics for Translation

arxiv

This is the official repository for the DEMETR dataset design to perform diagnostics on Machine Translation (MT) evaluation metrics (check out our paper for details). DEMETR consists of 35 perturbations spanning sematic, syntactic, and morphological error categories.

Some key features of DEMETR include:
:hibiscus: manually verified source text, human translation, and machine translation (special attention was given to avoid translation artifacts such as translationese);
:hibiscus: carefully designed perturbations based on the MQM error annotation schema;
:hibiscus: 10 different source languages (:poland: Polish, :czech_republic: Czech, 🇷🇺 Russian, 🇩🇪 German, 🇫🇷 French, 🇮🇹 Italian, 🇪🇸 Spanish, 🇯🇵 Japanese, 🇨🇳 Chinese, :india: Hindi) to challenge reference-less MT evaluation metrics;
:hibiscus: manual implementation or manual check on more challenging perturbation to assure their plausability.

Please contact us for scripts used for automatic perturbations as well as for scripts used to compute the metric scores presented in the paper.

Details

DEMETR dataset consists of 35 json files (one for each perturbation), each of which contains 1K test items. Each test item includes the following:

Here is an example of one entry:

{
    "id": 18,
    "src_sent": "该声明称,土耳其还将接管对被捕的伊斯兰国武装分子的看守任务;欧洲国家拒绝将他们遣送回国。",
    "eng_sent": "Turkey would also take over the task of guarding captured ISIS fighters which, the statement said, European nations have refused to repatriate.",
    "mt_sent": "Turkey would also take over custody of captured Islamic State fighters which, the statement said, European countries have refused to send home.",
    "pert_sent": "Turkey would also take over custody of captured Islamic State fighters which, the statement said, European countries have agreed to send home.",
    "lang_tag": "chinese_simple",
    "data_source": "FLORES",
    "pert_check": true,
    "severity": "critical",
    "pert_id": 7,
    "pert_desc": "changing a word to its antonym (noun, adv, adj, verb)",
    "pert_name": "critical_id7_antonym"
  },

Coming Soon...

Despite all work put into the creation of DEMETR, there are still some limitations to the dataset. One of such is that DEMETR is still sentence-level diagnostic dataset, which is unable to evaluate metrics' sensitivity to the discourse level errors. Another is that DEMETR contains only translations into English, which limits its diagnostic capabilities to the to-English translation. We recognize these limitations and are working on a sister dataset. Stay tuned!

Citation

If you use DEMETR, please cite it as follows:

@inproceedings{demetr-2022,
author={Marzena Karpinska and Nishant Raj and Katherine Thai and Yixiao Song and Ankita Gupta and Mohit Iyyer},
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
Year = "2022",
Title={DEMETR: Diagnosing Evaluation Metrics for Translation},
}