synthesized wavs of long texts

Liujingxiu23 commented 3 years ago

I downloaded the pretrained model of databaker and synthesized wavs using inference.py. The results are not very good, I mean the alignment is not right especially when the input text is long. For example, "失恋的人特别喜欢往人烟罕至的角落里钻。", the synthesized wavs sounds like: 失恋的人特别喜欢往人烟罕至的角角落里钻钻钻钻

For longer input text，the synthesized wavs are totally wrong

light1726 commented 3 years ago

Hi @Liujingxiu23, thanks for your feedback. Attention errors can happen for vaenar-tts since there's no restriction posed to attention alignment to make it monotonic, most of them are repetitions of phonemes. From my observation, such cases are rare. It never occurs to me that the synthesized waveform is totally wrong for a sentence.

Synthesis of long sentences is more challenging as there are not many long sentences in the training set.

I didn't do much parameter-tuning on the Mandarin dataset. I think there are at least 2 points that can be considered to improve the performance of Mandarin TTS: 1) Use phoneme as input or split Pinyin into consonant and vowel, instead of treating them as a pure character sequence as I do. 2) For the synthesis of out-of-dataset texts, do the prosodic boundary prediction as in the transcription.

Liujingxiu23 commented 3 years ago

@light1726 Thank you very much for your reply. I will training my mandarin dataset with phone-sequences and prosody boundary infos to see the performance.

thuhcsi / VAENAR-TTS

synthesized wavs of long texts #7