lena-voita / good-translation-wrong-in-context

This is a repository with the data and code for the ACL 2019 paper "When a Good Translation is Wrong in Context: ..." and the EMNLP 2019 paper "Context-Aware Monolingual Repair for Neural Machine Translation"
96 stars 18 forks source link

Differences between test.dst of Context-aware dataset and Docrepair dataset #9

Open xc-kiwiberry opened 4 years ago

xc-kiwiberry commented 4 years ago

Dear authors,

Thank you for publishing your code and data. It was organized well and convenient to follow. 👍

I have trained a sentence-level Transformer using context-agnostic training data and successfully reproduced the BLEU score (33.91 in emnlp2019 paper) on context-aware test set (remove bpe and '_eos', lowercase, 4 segments as a long sentence).

But I found that the "test.dst" in Docrepair dataset is different with "test.ru" in Contest-aware dataset.

The first line in "test.dst" in Docrepair dataset:

вчера ночью кто-то вломился в мой дом и украл эту урод `скую футболку . _eos да ... _eos я не верю в это . _eos она слишком свободная на мне , чувак .

The first line in "test.ru" in Contest-aware dataset:

Вчера ночью кто-то вломился в мой дом и украл эту уродскую футболку . _eos Да ... _eos Я не верю в это . _eos Она слишком свободная на мне , чувак .

Except for lowercasing, "test.dst" in Docrepair dataset has many " `" splitting some token (e.g., "уродскую" in the first line).

I want to know that:

  1. Which reference is correct?
  2. Does Docrepair dataset have different tokenization with context-aware dataset?

Looking forward to your reply. :)