marytts-it / marytts

MARY TTS -- an open-source, multilingual text-to-speech synthesis system written in pure java
http://mary.dfki.de
Other
1 stars 3 forks source link

To check phonemization of words with apostrophe #33

Open ftesser opened 11 years ago

ftesser commented 11 years ago

Example: in

dell'sei 

sei is erroneous phonemized using rules:

<?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.5" xml:lang="it">
<p>
<s>
<phrase>
<t g2p_method="rules" ph="d e ll - ' z e1 - i" pos="EA" pos_full="EAfs">
dell'sei
<syllable ph="d e ll">
<ph p="d"/>
<ph p="e"/>
<ph p="ll"/>
</syllable>
<syllable ph="z e1" stress="1">
<ph p="z"/>
<ph p="e1"/>
</syllable>
<syllable ph="i">
<ph p="i"/>
</syllable>
</t>
<boundary breakindex="5" tone="L-L%"/>
</phrase>
</s>
</p>
</maryxml>

On the contrary dell'azione seems to be correctly phonemized.

giuliopaci commented 11 years ago

This is a problematic issue: in Italian the apostrophe at the end of a word is used to mark a vowel removal. This happens often when the next word also starts with a vowel. In that case the tokenisation is correct. In this case it is not, basically because it is not orthographically correct. The current tokenisation module split the two tokens only when the first word ends with a consonant and the second word starts with a vowel. This allow to not tokenise foreign names (e.g., Ha'aretz) and expressions (e.g., don't), while correctly tokenising all the orthographically correct Italian expressions. It is impossible to solve this issue without lexicon lookup and I do not like the idea to add another lexicon lookup layer.

ftesser commented 11 years ago

OK, we can assert, that the apostrophe is managed in the right way if the text is written in correct Italian. We can close this issue.

giuliopaci commented 11 years ago

Thinking again about it, there are some cases where it may happen that two consonants separated by an apostrophe implies a token separation, that is when the second word is an acronym or a roman number that expand to a pronunciation that begin with a vowel (e.g., l'NBA, l'RNA, l'XI, l'VIII, ...). These cases are not handled yet (neither tokenisation nor expansion).