Closed Archonsh closed 4 years ago
You have to learn BPE rules separately for English and Russian, then apply these rules to the corresponding fragments in the formatted text.
For example, you can learn them on context-agnostic data (the one for the baseline): train.src and train.dst files contain English and Russian sentences respectively; these come from the same OpenSubtitles dataset.
At least, this is the way I did it :)
Code Version: ec8dc79
Dear authors,
As mentioned in the repository readme, this 30m document-level data needs to be BPE-ized for training.
However, I found no scripts or instructions about how to apply the BPE on the DocRepair dataset. I have the following questions:
To learn BPE rules, shall I learn the rules from the train.dst only? Or shall I learn two sets of rules from train.src and train.dst separately and apply each rule to the corresponding dataset?
To apply the BPE rules, you mentioned that each sentence should be processed separately and keep all _eos tokens whole. However, I found that by calling the apply_bpe.py directly, it messed up the data format for the train.src. I wonder if you have any scripts to apply BPE on the formatted datasets?