As the script set spm_train --add_dummy_prefix 1, the resulting subword file begins with '▁'.
This makes the word alignment, which is converted from subword, all start from 1.
The issus do not affect the result of word fastalign, but completely break the fastalign using BPE tokenization.
As the script set spm_train --add_dummy_prefix 1, the resulting subword file begins with '▁'. This makes the word alignment, which is converted from subword, all start from 1. The issus do not affect the result of word fastalign, but completely break the fastalign using BPE tokenization.
Here show my result before the fix:
Here show my result (bpe part) after the fix (by -1 for all idx in converted talp file):
Reference from results/fastalign in the repo: