BPE code used in source and target data

nusnlp / mlconvgec2018

Code and model files for the paper: "A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction" (AAAI-18).

GNU General Public License v3.0

185 stars 73 forks source link

BPE code used in source and target data #12

Closed xixiddd closed 5 years ago

xixiddd commented 5 years ago

Hi, Shamil Chollampatt. In the Model and Training Details Section of your paper, you said that "Each of the source and target vocabularies consists of 30K most frequent BPE tokens from the source and target side of the parallel data, respectively.", but according to this line in the preprocessing scripts(i.e. training/preprocess.sh), it seems that you only use the target-end data to learn BPE codes and then, apply it to both source and target data.

shamilcm commented 5 years ago

The BPE model is trained using 30,000 operations using the target side of the training data according to the line that you pointed to. The source/target vocabularies for the encoder-decoder model consist of 30,000 most frequent subwords (or BPE segmented tokens) from the source/target sides of the parallel data (see line) .