How to do BPE segmentation on DocRepair dataset?

lena-voita / good-translation-wrong-in-context

This is a repository with the data and code for the ACL 2019 paper "When a Good Translation is Wrong in Context: ..." and the EMNLP 2019 paper "Context-Aware Monolingual Repair for Neural Machine Translation"

96 stars 18 forks source link

Code Version: ec8dc79

Dear authors,

As mentioned in the repository readme, this 30m document-level data needs to be BPE-ized for training.

However, I found no scripts or instructions about how to apply the BPE on the DocRepair dataset. I have the following questions:

To learn BPE rules, shall I learn the rules from the train.dst only? Or shall I learn two sets of rules from train.src and train.dst separately and apply each rule to the corresponding dataset?
To apply the BPE rules, you mentioned that each sentence should be processed separately and keep all _eos tokens whole. However, I found that by calling the apply_bpe.py directly, it messed up the data format for the train.src. I wonder if you have any scripts to apply BPE on the formatted datasets?

lena-voita / good-translation-wrong-in-context

How to do BPE segmentation on DocRepair dataset? #4