joshua-decoder / thrax

Hadoop-based tool for extraction of large scale synchronous grammars for paraphrasing and machine translation
joshua-decoder.org
Other
15 stars 6 forks source link

Validate alignments #7

Open mjpost opened 9 years ago

mjpost commented 9 years ago

If the alignments are invalid (say, because you concatenate the alignments with an untokenized version of the input corpus), Thrax never really tells you, it will just barf at some point down the road.

Thrax should do a validation pass to make sure all the alignments are sensible.

mjpost commented 8 years ago

More generally, the input should be validated. You can pass a totally bogus file in as the thrax input file, and only learn about it deep in the pipeline. For example, The following input file will only cause cryptic errors:

¿ aló ? hello ? 1-0 0-1 2-1
hola .  hello . 0-0 1-1
¿ con quién hablo ?     with whom am i speaking ?       1-0 2-1 3-2 3-3 3-4 0-5 4-5
eh , silvia . sí , ¿ cómo se llama ?    eh , silvia , yes . what is your name ? 0-0 1-1 2-2 3-3 4-4 5-5 6-6 7-6 8-7 9-8 9-9 10-10
hola , silvia . eh , yo me llamo nicole .       hello silvia , eh , my name is nicole . 0-0 2-1 3-2 4-3 5-4 6-5 7-6 8-6 8-7 9-8 10-9
ah , mucho gusto .      ah , nice to meet you . 0-0 1-1 2-2 3-3 3-4 3-5 4-6
mucho gusto . em , ¿ y dónde está usted ?       nice to meet you . em , and where are you from ?    0-0 1-0 1-1 1-2 1-3 2-4 3-5 4-6 6-7 7-8 8-9 9-10 9-11 5-12 10-12
n- eh , yo estoy en filadelfia .        eh , i 'm in philadelphia .     0-0 1-0 2-1 3-2 4-3 5-4 6-5 7-6

(Note that there should be |||s instead of tabs separating the fields.)