Is we really need to tokenizer before feeding into your library ? Because as I can see, every 2, 3 ,... syllables word in vietnamese have phoneme is the combine of all 1 syllables word
Example:
cái : kaj˨˦
gì : ɣi˧˨
cái gì : kaj˨˦ɣi˧˨
I just loop over viet-n.tsv file and find out no exception.
Logically I think we shouldn't use tokenizer here since phoneme and syllable share the same role in sentence.
Pls let me know what you think about this.
There are two main reasons why we built the text2phonemesequence based on the word-level:
We used the CharsiuG2P toolkit, which was trained on the word-level to convert graphemes to phonemes. Therefore, in order to maintain the performance of the G2P toolkit for multilingual purposes, we also built it based on the word-level.
We believe that fine-tuning the TTS model with phonemes from a word-segmented sentence may improve the TTS system in terms of prosody and naturalness.
However, I also think that your idea makes sense. Perhaps we can compare the performance of both approaches when we have the time.
Thank you for your interest!
Is we really need to tokenizer before feeding into your library ? Because as I can see, every 2, 3 ,... syllables word in vietnamese have phoneme is the combine of all 1 syllables word Example: cái : kaj˨˦ gì : ɣi˧˨ cái gì : kaj˨˦ɣi˧˨ I just loop over viet-n.tsv file and find out no exception. Logically I think we shouldn't use tokenizer here since phoneme and syllable share the same role in sentence. Pls let me know what you think about this.