Closed kellymarchisio closed 3 years ago
@kellymarchisio Thanks for the suggestion.
I created a PR https://github.com/lilt/alignment-scripts/pull/8. Somehow I only found non breaking spaces in the Ro>En training data when using sed 's/\xC2\xA0/ /g'
and diffing the output with the original file. If you have an example for En>Fr that I missed or feedback on the PR give me a heads up. Otherwise I'll merge it on Friday.
@thomasZen - Sorry about that - En-Fr also broke so I assumed it was the same issue. Issue there was actually because of these lines in scripts/fast_align.sh
paste -d "~" ${source_path} ${target_path} | sed 's/~/ ||| /g' > ${source_name}_${target_name}
paste -d "~" ${target_path} ${source_path} | sed 's/~/ ||| /g' > ${target_name}_${source_name}
The En-Fr training data contains a legitimate ~ so one line was converted to 1220 ||| 0.1 ||| 220 because ".1~220" is on the French side. I used this instead:
paste ${source_path} ${target_path} | sed -E 's/\t/ ||| /g' > ${source_name}_${target_name}
paste ${target_path} ${source_path} | sed -E 's/\t/ ||| /g' > ${target_name}_${source_name}
Oh, I missed that. I updated the fastAlign scripts in https://github.com/lilt/alignment-scripts/pull/8/commits/e13f3d688f1d3579d55e69f046d54b522eb1ae76.
If you happen to run the complete fastAlign or Giza pipeline again feel free to update the results in a PR. And thanks for the improvements!
FYI, there are non-breaking space characters in some of the Ro-En and En-Fr training files, which causes the nonbreaking whitespace character to be interpreted as a vocabulary word. For instance, we see this in train/roen.lc.plustest.src.vcb:
One offending file, for instance, is train/Romanian-English/training/Newspapers/2002.10.02.english.98880.0977.e:
I fixed this problem by sending the training sentences through sed 's/\xC2\xA0/ /g' (source: https://rmoff.net/2019/01/21/replacing-utf8-non-breaking-space-with-bash/sed-on-the-mac/) This should perhaps be added to your preprocessing pipeline, since I think the chances of the source published in 2003 being changed are low :)