Closed Osiris-Team closed 2 years ago
Closing this because of no response
Same issue here using:
tts --text "One minute, the hill was bright with sun, and the next it was deep in shadows, and the wind that had been merely cool was downright cold." --out_path <path-to-output-file>
but replace the third comma with a full stop, and the entire text is successfully rendered.
tts --text "One minute, the hill was bright with sun, and the next it was deep in shadows. And the wind that had been merely cool was downright cold." --out_path <path-to-output-file>
It certainly looks to be an issue with sentence length. Tested on Xubuntu 20.04 with Python 3.8.10.
Increase the value of "max_decoder_steps".
For example, I use the Tacotron2 model.
tts --text "Hello"
> tts_models/en/ljspeech/tacotron2-DDC is already downloaded.
> vocoder_models/en/ljspeech/hifigan_v2 is already downloaded.
> Using model: Tacotron2
> Model's reduction rate `r` is set to: 1
> Vocoder Model: hifigan
> Generator Model: hifigan_generator
> Discriminator Model: hifigan_discriminator
The installed project can be found here. Debian 10.
/home/user/.local/lib/python3.7/site-packages/TTS
We need a config file.
/home/user/.local/lib/python3.7/site-packages/TTS/tts/configs/tacotron_config.py
Changing values.
max_decoder_steps: int = 500
to
max_decoder_steps: int = 10000
Thanks. This kinda should be added to the installation steps.
Increase the value of "max_decoder_steps".
Thanks RonyMacfly, that works.
It also works in virtual environments. In my case the tacotron_config.py
file was in
.venv/lib/python3.8/site-packages/TTS/tts/configs/
okay, been playing around a lot with this, as of today my best output so far uses this tts_models/zh-CN/baker/tacotron2-DDC-GS while it might appear the CN indicates a Chinese source, this is not important when building a model, it's abstracting anything like words or letters, it's creating a multidimensional topological manifold that embodies 'speech' at it's essence... I am working with around 80 to 100 words, I use python language tool to clean the text and fix grammar, more on that later, then pass it in text chunks to a loop that synthesizes and saves the audio chunk, then concatenates the audio chunks into a single file to play.
Decoder steps, a setting within the TTS library configuration (tacotron) is deceptive, it feels like longer (10000) should allow longer text, however, if you look at the quality, it drops off dramatically after 500 to 800 steps, more steps does not make it better, so chunking text, then synthesis then concatenate and play keeps the decoder fed nicely, the training data is on a bell curve, it's most 'experience' lies in maybe 80 to 120 words, it 'knows' what that sounds like, if you ask it for 1 second, or 10 minutes, the model collapses completely, it does not recognize this and does a (poor) job of trying to synthesize.
It is also tripped up by just wrong text, if the sentence structure is poor, the model struggles to simulate, I am experimenting with using an 'AI' to format the text in a style that best suits the TTS model to improve the output.
I am also looking at other improvements such as creating a baseline version then another version with my chosen TTS model, then I would compare the audio lengths and discard the longest, this would pick out the occasional errors where something like '<hello =sorry i bokre theAI' tries to get spoken with hilarious or fairly tragic results depending on your sense of humor...
Also I can roughly guess how long audio should be and count words and compare that to audio output length, or feed the output into a speech to text model and then get it's grammar judged versus the supplied text, then an additional loop that let's AI remix the text continuously until it agrees the output is good... plus, I can get my twitch channel 'infinifiction' to ask listeners to rate the best speech and use that to retrain a model specialized in the length and style of output I want for this specific task...
On a Radeon 9 6900HX I get a realtime factor of around 0.3, so I can confidently generate around 60 seconds of audio in 30 seconds.
Steps to reproduce:
python -m pip install TTS
tts --text "Hello my name is Johanna, and today I want to talk a bit about AutoPlug. In short, AutoPlug is a feature-rich, modularized server manager, that automates the most tedious parts of your servers or networks maintenance." --out_path INSERT_ABSOLUTE_DIR_PATH_HERE\output.wav
Result: The output.wav file is around 10 seconds long and the reader stops talking around the middle of the text ("... server manager, that...").
System: Windows 10 x64 bit
Looks like its related to the length of a sentence...