sanxing-chen / NMT2017-ZH-EN

Pre-processing and training scripts for WMT 2017 ZH-EN translation task
39 stars 4 forks source link

Lower performance in alignment compared to another preprocessing script. #5

Open haorannlp opened 3 years ago

haorannlp commented 3 years ago

Hi Sanxing, thank you for sharing this script!

I run your preprocess.py (clean empty lines; I did not run the whole prepare.sh) and then use fast_align to learn an alignment model on the parallel corpus. I found that the perplexity of alignmens given by the alignment model is higher than the results of the parallel corpus preprocessed by another script wmt.py. I guess this is due to that they merge the blank lines. So could you possibly add this merge blank lines function into your script in the future? Thanks a lot!