mozilla / TTS

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Mozilla Public License 2.0
9.28k stars 1.24k forks source link

> Decoder stopped with `max_decoder_steps` 500 #734

Closed Osiris-Team closed 2 years ago

Osiris-Team commented 2 years ago

Steps to reproduce:

  1. Install TTS with python -m pip install TTS
  2. Run in console: tts --text "Hello my name is Johanna, and today I want to talk a bit about AutoPlug. In short, AutoPlug is a feature-rich, modularized server manager, that automates the most tedious parts of your servers or networks maintenance." --out_path INSERT_ABSOLUTE_DIR_PATH_HERE\output.wav

Result: The output.wav file is around 10 seconds long and the reader stops talking around the middle of the text ("... server manager, that...").

System: Windows 10 x64 bit

Looks like its related to the length of a sentence...

Osiris-Team commented 2 years ago

Closing this because of no response

SteveDaulton commented 2 years ago

Same issue here using: tts --text "One minute, the hill was bright with sun, and the next it was deep in shadows, and the wind that had been merely cool was downright cold." --out_path <path-to-output-file> but replace the third comma with a full stop, and the entire text is successfully rendered. tts --text "One minute, the hill was bright with sun, and the next it was deep in shadows. And the wind that had been merely cool was downright cold." --out_path <path-to-output-file>

It certainly looks to be an issue with sentence length. Tested on Xubuntu 20.04 with Python 3.8.10.

RonyMacfly commented 2 years ago

Increase the value of "max_decoder_steps".

For example, I use the Tacotron2 model.

tts --text "Hello"
 > tts_models/en/ljspeech/tacotron2-DDC is already downloaded.
 > vocoder_models/en/ljspeech/hifigan_v2 is already downloaded.
 > Using model: Tacotron2
 > Model's reduction rate `r` is set to: 1
 > Vocoder Model: hifigan
 > Generator Model: hifigan_generator
 > Discriminator Model: hifigan_discriminator

The installed project can be found here. Debian 10. /home/user/.local/lib/python3.7/site-packages/TTS

We need a config file. /home/user/.local/lib/python3.7/site-packages/TTS/tts/configs/tacotron_config.py

Changing values. max_decoder_steps: int = 500 to max_decoder_steps: int = 10000

Osiris-Team commented 2 years ago

Thanks. This kinda should be added to the installation steps.

SteveDaulton commented 2 years ago

Increase the value of "max_decoder_steps".

Thanks RonyMacfly, that works.

It also works in virtual environments. In my case the tacotron_config.py file was in .venv/lib/python3.8/site-packages/TTS/tts/configs/

tudorw commented 1 year ago

okay, been playing around a lot with this, as of today my best output so far uses this tts_models/zh-CN/baker/tacotron2-DDC-GS while it might appear the CN indicates a Chinese source, this is not important when building a model, it's abstracting anything like words or letters, it's creating a multidimensional topological manifold that embodies 'speech' at it's essence... I am working with around 80 to 100 words, I use python language tool to clean the text and fix grammar, more on that later, then pass it in text chunks to a loop that synthesizes and saves the audio chunk, then concatenates the audio chunks into a single file to play.

Decoder steps, a setting within the TTS library configuration (tacotron) is deceptive, it feels like longer (10000) should allow longer text, however, if you look at the quality, it drops off dramatically after 500 to 800 steps, more steps does not make it better, so chunking text, then synthesis then concatenate and play keeps the decoder fed nicely, the training data is on a bell curve, it's most 'experience' lies in maybe 80 to 120 words, it 'knows' what that sounds like, if you ask it for 1 second, or 10 minutes, the model collapses completely, it does not recognize this and does a (poor) job of trying to synthesize.

It is also tripped up by just wrong text, if the sentence structure is poor, the model struggles to simulate, I am experimenting with using an 'AI' to format the text in a style that best suits the TTS model to improve the output.

I am also looking at other improvements such as creating a baseline version then another version with my chosen TTS model, then I would compare the audio lengths and discard the longest, this would pick out the occasional errors where something like '<hello =sorry i bokre theAI' tries to get spoken with hilarious or fairly tragic results depending on your sense of humor...

Also I can roughly guess how long audio should be and count words and compare that to audio output length, or feed the output into a speech to text model and then get it's grammar judged versus the supplied text, then an additional loop that let's AI remix the text continuously until it agrees the output is good... plus, I can get my twitch channel 'infinifiction' to ask listeners to rate the best speech and use that to retrain a model specialized in the length and style of output I want for this specific task...

tudorw commented 1 year ago

On a Radeon 9 6900HX I get a realtime factor of around 0.3, so I can confidently generate around 60 seconds of audio in 30 seconds.