lena-voita / good-translation-wrong-in-context

This is a repository with the data and code for the ACL 2019 paper "When a Good Translation is Wrong in Context: ..." and the EMNLP 2019 paper "Context-Aware Monolingual Repair for Neural Machine Translation"
96 stars 18 forks source link

How to do BPE segmentation on DocRepair dataset? #4

Closed Archonsh closed 4 years ago

Archonsh commented 4 years ago

Code Version: ec8dc79

Dear authors,

As mentioned in the repository readme, this 30m document-level data needs to be BPE-ized for training.

However, I found no scripts or instructions about how to apply the BPE on the DocRepair dataset. I have the following questions:

  1. To learn BPE rules, shall I learn the rules from the train.dst only? Or shall I learn two sets of rules from train.src and train.dst separately and apply each rule to the corresponding dataset?

  2. To apply the BPE rules, you mentioned that each sentence should be processed separately and keep all _eos tokens whole. However, I found that by calling the apply_bpe.py directly, it messed up the data format for the train.src. I wonder if you have any scripts to apply BPE on the formatted datasets?

lena-voita commented 4 years ago

You have to learn BPE rules separately for English and Russian, then apply these rules to the corresponding fragments in the formatted text.

For example, you can learn them on context-agnostic data (the one for the baseline): train.src and train.dst files contain English and Russian sentences respectively; these come from the same OpenSubtitles dataset.

At least, this is the way I did it :)