Open ftesser opened 11 years ago
This is a problematic issue: in Italian the apostrophe at the end of a word is used to mark a vowel removal. This happens often when the next word also starts with a vowel. In that case the tokenisation is correct. In this case it is not, basically because it is not orthographically correct. The current tokenisation module split the two tokens only when the first word ends with a consonant and the second word starts with a vowel. This allow to not tokenise foreign names (e.g., Ha'aretz) and expressions (e.g., don't), while correctly tokenising all the orthographically correct Italian expressions. It is impossible to solve this issue without lexicon lookup and I do not like the idea to add another lexicon lookup layer.
OK, we can assert, that the apostrophe is managed in the right way if the text is written in correct Italian. We can close this issue.
Thinking again about it, there are some cases where it may happen that two consonants separated by an apostrophe implies a token separation, that is when the second word is an acronym or a roman number that expand to a pronunciation that begin with a vowel (e.g., l'NBA, l'RNA, l'XI, l'VIII, ...). These cases are not handled yet (neither tokenisation nor expansion).
Example: in
sei
is erroneous phonemized using rules:On the contrary
dell'azione
seems to be correctly phonemized.