jaywalnut310 / vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
https://jaywalnut310.github.io/vits-demo/index.html
MIT License
6.73k stars 1.24k forks source link

Mispronounce some words and 44,1 Khz audio #85

Open tuannvhust opened 2 years ago

tuannvhust commented 2 years ago
  1. some people claim that Mispronounciation is one of the noticeable disadvantages of VITS model. I experienced the same problem too. Does anybody know what is the reason of mispronounciation?
  2. I used the 44,1 Khz dataset to train the model. Because the higher resolution of the data, it seems synthesized speech shows the noise more significantly. Can anybody give me some suggestions for this problem.
nikich340 commented 1 year ago

It can be eSpeak phonemizer problem. You can edit text preprocessing scripts to make it accept IPA phonemes directly and change them as you need.

weixsong commented 1 year ago

Hi, I also suffered the mis-pronounciation issue when using Chinese phoneme as input, any update there? It seems that the trained model with LJSpeech dataset by using IPA input does not suffer the mis-pronounciation issue, or just because English is not my mother tongue that I could not notice the mis-pronouncitation badcase?