Training data for DocRepair model

lena-voita / good-translation-wrong-in-context

This is a repository with the data and code for the ACL 2019 paper "When a Good Translation is Wrong in Context: ..." and the EMNLP 2019 paper "Context-Aware Monolingual Repair for Neural Machine Translation"

97 stars 18 forks source link

Training data for DocRepair model #6

Closed Archonsh closed 4 years ago

Archonsh commented 4 years ago

Dear authors,

In the Docrepair paper, you mentioned "the baseline MT system was trained on the data released by Voita et al. (2019)" with 6m instances from OpenSubtitles2018 corpus (Lison et al., 2018) for English and Russian.

Following which you mention that the DocRepair model was trained on "30m groups of 4 consecutive sentences as our monolingual data".

I wonder is the 30m data also from OpenSubtitles2018 corpus? If yes, is there any overlapping between the 6m training data for the baseline MT system and DocRepair?

Thank you very much!

lena-voita commented 4 years ago

Dear Archonsh,

Yes, the 30m data is also from OpenSubtitles dataset, and it overlaps with target side of 6m parallel instances.

Best regards, Lena.

Archonsh commented 4 years ago

Thank you for your reply!

May I know what do you mean by "overlaps with the target side"?

Do you mean the 30m data for DocRepair is purely constructed from the Russian side of the 6m OpenSubtitles data? Does the 30m data contains any other data except for the 6m Russian data?

Sincerely, Archonsh

lena-voita commented 4 years ago

Yes, since DocRepair model is a monolingual model, its training data contains only Russian sentences. It is much bigger than 6m sentence pairs, so yes, it contains other data in addition to target sides of the 6m parallel sentences.

Best regards, Lena.