jaywalnut310 / vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
https://jaywalnut310.github.io/vits-demo/index.html
MIT License
6.48k stars 1.21k forks source link

Issues with VITS: Mixed Voices and Missing Number Synthesis #212

Open iliuha93 opened 1 month ago

iliuha93 commented 1 month ago

Hello,

I am using the VITS model for text-to-speech synthesis with a configuration that specifies using a single voice. However, I am encountering two issues:

  1. Sometimes the output speech is partially voiced by a male and partially by a female voice, even though the configuration is set to use a single voice.
  2. The model does not synthesize numbers correctly.

Here is my current configuration:


{
  "_name_or_path": "facebook/mms-tts-deu",
  "activation_dropout": 0.1,
  "architectures": ["VitsModel"]
}