Issues with VITS: Mixed Voices and Missing Number Synthesis

Hello,

I am using the VITS model for text-to-speech synthesis with a configuration that specifies using a single voice. However, I am encountering two issues:

Sometimes the output speech is partially voiced by a male and partially by a female voice, even though the configuration is set to use a single voice.
The model does not synthesize numbers correctly.

Here is my current configuration:


{
  "_name_or_path": "facebook/mms-tts-deu",
  "activation_dropout": 0.1,
  "architectures": ["VitsModel"]
}

jaywalnut310 / vits

Issues with VITS: Mixed Voices and Missing Number Synthesis #212