lilt / alignment-scripts

Scripts to preprocess training and test data and to run fast_align and giza
MIT License
109 stars 22 forks source link

spm BPE part is broken due to the set of --add_dummy_prefix #5

Closed Zenglinxiao closed 5 years ago

Zenglinxiao commented 5 years ago

As the script set spm_train --add_dummy_prefix 1, the resulting subword file begins with '▁'. This makes the word alignment, which is converted from subword, all start from 1. The issus do not affect the result of word fastalign, but completely break the fastalign using BPE tokenization.

Here show my result before the fix:

test.deen.bpe.word.grow-diagonal-final.talp: 64.4% (39.8%/32.0%/8127) test.deen.bpe.word.grow-diagonal.talp: 63.8% (41.7%/31.7%/7647) test.deen.bpe.word.intersection.talp: 65.7% (41.2%/29.2%/7127) test.deen.bpe.word.reverse.talp: 65.9% (34.8%/33.4%/9700) test.deen.bpe.word.talp: 65.7% (34.5%/34.2%/10009) test.deen.bpe.word.union.talp: 65.9% (30.9%/38.4%/12582) test.deen.grow-diagonal-final.talp: 27.7% (80.7%/65.5%/7964) test.deen.grow-diagonal.talp: 27.0% (84.6%/64.1%/7418) test.deen.intersection.talp: 28.0% (87.1%/61.3%/6863) test.deen.reverse.talp: 32.0% (69.7%/66.4%/9414) test.deen.talp: 28.4% (71.3%/71.8%/9932) test.deen.union.talp: 31.8% (61.4%/77.0%/12483)

Here show my result (bpe part) after the fix (by -1 for all idx in converted talp file):

test.deen.bpe.word.talp.grow-diagonal-final.talp: 25.9% (81.7%/67.6%/8127) test.deen.bpe.word.talp.grow-diagonal.talp: 25.2% (85.2%/66.5%/7647) test.deen.bpe.word.talp.intersection.talp: 26.2% (87.3%/63.8%/7127) test.deen.bpe.word.talp.union.talp: 31.1% (61.8%/78.1%/12582) test.deen.bpe.word.reverse.talp.new: 30.4% (70.2%/68.9%/9700)

Reference from results/fastalign in the repo:

test.deen.bpe.word.grow-diagonal-final.talp: 27.0% (79.8%/67.2%/8270) test.deen.bpe.word.grow-diagonal.talp: 26.4% (83.2%/65.9%/7755) test.deen.bpe.word.intersection.talp: 27.2% (85.9%/63.1%/7171) test.deen.bpe.word.reverse.talp: 30.9% (68.7%/69.4%/9985) test.deen.bpe.word.talp: 29.8% (69.4%/71.1%/10099) test.deen.bpe.word.union.talp: 32.7% (59.7%/77.4%/12913) test.deen.grow-diagonal-final.talp: 27.7% (80.6%/65.5%/7964) test.deen.grow-diagonal.talp: 27.0% (84.5%/64.1%/7421) test.deen.intersection.talp: 28.0% (87.1%/61.3%/6863) test.deen.reverse.talp: 32.0% (69.7%/66.4%/9421) test.deen.talp: 28.5% (71.3%/71.8%/9930) test.deen.union.talp: 31.8% (61.4%/77.0%/12488)

thomasZen commented 5 years ago

Thanks for your PR, it's merged now. If you are interested, you can also create a PR with the updated FastAlign results.