lilt / alignment-scripts

Scripts to preprocess training and test data and to run fast_align and giza
MIT License
108 stars 23 forks source link

Normalize non breaking spaces for Romanian>English training data #8

Closed thomasZen closed 3 years ago

thomasZen commented 3 years ago

Fixes https://github.com/lilt/alignment-scripts/issues/7.

Additionally, remove inconsistent command line argument to spm_train which appeared twice (--add_dummy_prefix 1 appeared later and was used, I kept this option). Also use tab as a separator (instead of ~) when preparing fastAlign data format.