This is a repository with the data and code for the ACL 2019 paper "When a Good Translation is Wrong in Context: ..." and the EMNLP 2019 paper "Context-Aware Monolingual Repair for Neural Machine Translation"
96
stars
18
forks
source link
Differences between test.dst of Context-aware dataset and Docrepair dataset #9
Thank you for publishing your code and data. It was organized well and convenient to follow. 👍
I have trained a sentence-level Transformer using context-agnostic training data and successfully reproduced the BLEU score (33.91 in emnlp2019 paper) on context-aware test set (remove bpe and '_eos', lowercase, 4 segments as a long sentence).
But I found that the "test.dst" in Docrepair dataset is different with "test.ru" in Contest-aware dataset.
The first line in "test.dst" in Docrepair dataset:
вчера ночью кто-то вломился в мой дом и украл эту урод `скую футболку . _eos да ... _eos я не верю в это . _eos она слишком свободная на мне , чувак .
The first line in "test.ru" in Contest-aware dataset:
Вчера ночью кто-то вломился в мой дом и украл эту уродскую футболку . _eos Да ... _eos Я не верю в это . _eos Она слишком свободная на мне , чувак .
Except for lowercasing, "test.dst" in Docrepair dataset has many " `" splitting some token (e.g., "уродскую" in the first line).
I want to know that:
Which reference is correct?
Does Docrepair dataset have different tokenization with context-aware dataset?
Dear authors,
Thank you for publishing your code and data. It was organized well and convenient to follow. 👍
I have trained a sentence-level Transformer using context-agnostic training data and successfully reproduced the BLEU score (33.91 in emnlp2019 paper) on context-aware test set (remove bpe and '_eos', lowercase, 4 segments as a long sentence).
But I found that the "test.dst" in Docrepair dataset is different with "test.ru" in Contest-aware dataset.
The first line in "test.dst" in Docrepair dataset:
The first line in "test.ru" in Contest-aware dataset:
Except for lowercasing, "test.dst" in Docrepair dataset has many " `" splitting some token (e.g., "уродскую" in the first line).
I want to know that:
Looking forward to your reply. :)