I run your preprocess.py (clean empty lines; I did not run the whole prepare.sh) and then use fast_align to learn an alignment model on the parallel corpus.
I found that the perplexity of alignmens given by the alignment model is higher than the results of the parallel corpus preprocessed by another script wmt.py.
I guess this is due to that they merge the blank lines.
So could you possibly add this merge blank lines function into your script in the future? Thanks a lot!
Hi Sanxing, thank you for sharing this script!
I run your
preprocess.py
(clean empty lines; I did not run the wholeprepare.sh
) and then usefast_align
to learn an alignment model on the parallel corpus. I found that the perplexity of alignmens given by the alignment model is higher than the results of the parallel corpus preprocessed by another script wmt.py. I guess this is due to that they merge the blank lines. So could you possibly add thismerge blank lines
function into your script in the future? Thanks a lot!