redpony / cdec

Decoder, aligner, and model optimizer for statistical machine translation and other structured prediction models based on (mostly) context-free formalisms
http://cdec-decoder.org/
Apache License 2.0
183 stars 77 forks source link

tokenize-anything.sh on Italian #79

Open mdtux89 opened 9 years ago

mdtux89 commented 9 years ago

Hi, I just wanted to let you know an error the tokenize-anything.sh script makes for Italian sentences, that is it doesn't split "C'è" ("There's").

nschneid commented 9 years ago

This also applies to other contractions whose second part is "'è".

nschneid commented 9 years ago

Examples of other contractions that should be split, but aren't:

l'uomo all'interno nell'obligo

These involve articles. Before a vowel, definite articles are spelled l'. Combining with prepositions yields all', dall', dell', nell', sull'. The feminine indefinite article is realized as un' before a vowel.