Closed Archonsh closed 4 years ago
Dear Archonsh,
Yes, the 30m data is also from OpenSubtitles dataset, and it overlaps with target side of 6m parallel instances.
Best regards, Lena.
Thank you for your reply!
May I know what do you mean by "overlaps with the target side"?
Do you mean the 30m data for DocRepair is purely constructed from the Russian side of the 6m OpenSubtitles data? Does the 30m data contains any other data except for the 6m Russian data?
Sincerely, Archonsh
Yes, since DocRepair model is a monolingual model, its training data contains only Russian sentences. It is much bigger than 6m sentence pairs, so yes, it contains other data in addition to target sides of the 6m parallel sentences.
Best regards, Lena.
Dear authors,
In the Docrepair paper, you mentioned "the baseline MT system was trained on the data released by Voita et al. (2019)" with 6m instances from OpenSubtitles2018 corpus (Lison et al., 2018) for English and Russian.
Following which you mention that the DocRepair model was trained on "30m groups of 4 consecutive sentences as our monolingual data".
I wonder is the 30m data also from OpenSubtitles2018 corpus? If yes, is there any overlapping between the 6m training data for the baseline MT system and DocRepair?
Thank you very much!