lilt / alignment-scripts

Scripts to preprocess training and test data and to run fast_align and giza
MIT License
108 stars 23 forks source link

fix sp2word, as we set --add_dummy_prefix 1 #4

Closed Zenglinxiao closed 4 years ago

Zenglinxiao commented 4 years ago

As in the script ./preprocess/run.sh, we set spm option --add_dummy_prefix 1(also default in spm_train), which makes the resulting subword corpus contain '▁' at the beginning of each sentence, makes converted word alignment count from 1, which breaks the AER calculate for subword part. To fix this issue, we can simply subtract 1 to counteract this effect.