mozilla / TTS

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Mozilla Public License 2.0
9.42k stars 1.26k forks source link

Tacotron (2?) based models appear to be limited to rather short input #739

Open deliciouslytyped opened 2 years ago

deliciouslytyped commented 2 years ago

Running tts --text on some meaningful sentences results in the following output:

$ tts --text "An important event is the scheduling that periodically raises or lowers the CPU priority for each process in the system based on that process’s recent CPU usage (see Section 4.4). The rescheduling calculation is done once per second. The scheduler is started at boot time, and each time that it runs, it requests that it be invoked again 1 second in the future."                                                           
 > tts_models/en/ljspeech/tacotron2-DDC is already downloaded.
 > vocoder_models/en/ljspeech/hifigan_v2 is already downloaded.
 > Using model: Tacotron2
 > Model's reduction rate `r` is set to: 1
 > Vocoder Model: hifigan
 > Generator Model: hifigan_generator
 > Discriminator Model: hifigan_discriminator
Removing weight norm...
 > Text: An important event is the scheduling that periodically raises or lowers the CPU priority for each process in the system based on that process’s recent CPU usage (see Section 4.4). The rescheduling calculation is done once per second. The scheduler is started at boot time, and each time that it runs, it requests that it be invoked again 1 second in the future.
 > Text splitted to sentences.
['An important event is the scheduling that periodically raises or lowers the CPU priority for each process in the system based on that process’s recent CPU usage (see Section 4.4).', 'The rescheduling calculation is done once per second.', 'The scheduler is started at boot time, and each time that it runs, it requests that it be invoked again 1 second in the future.']
   > Decoder stopped with `max_decoder_steps` 500
   > Decoder stopped with `max_decoder_steps` 500
 > Processing time: 52.66666388511658
 > Real-time factor: 3.1740607061125763
 > Saving output to tts_output.wav

The audio file is truncated with respect to the text. If I hack the config file at TTS/tts/configs/tacotron_config.py to have a larger max_decoder_steps value, the output does seem to successfully get longer, but I'm not sure how safe this is.

Are there any better solutions? Should I use a different model?

deliciouslytyped commented 2 years ago

I'm confused because sometime this works, other times it doesn't. Using the "test" test string, at first I got a synthesis with an extended end and malformed audio, then it worked and I couldn't reproduce it anymore. I don't think I changed anything, but I'm not sure.

Now, I accidentally reproduced the bad sample: badoutput.zip (this is a zipped wav file, due to GitHub's restrictions)

Instead of just "test", you can hear something like "test-t-t-t-t-t-t-t....".

All I changed is max_decoder_steps to 1000.

jeaye commented 2 years ago

I get the same thing. If sentences are past a certain length, they are cut off in the produced wav. Here's a simple example:

❯ tts --text "This sentence, being as long as it is, most unfortunately, will not be fully stated." --out_path test.wav
 > tts_models/en/ljspeech/tacotron2-DDC is already downloaded.
 > vocoder_models/en/ljspeech/hifigan_v2 is already downloaded.
 > Using model: Tacotron2
 > Model's reduction rate `r` is set to: 1
 > Vocoder Model: hifigan
 > Generator Model: hifigan_generator
 > Discriminator Model: hifigan_discriminator
Removing weight norm...
 > Text: This sentence, being as long as it is, most unfortunately, will not be fully stated.
 > Text splitted to sentences.
['This sentence, being as long as it is, most unfortunately, will not be fully stated.']
   > Decoder stopped with `max_decoder_steps` 500
 > Processing time: 3.1818737983703613
 > Real-time factor: 0.49914852912682467
 > Saving output to test.wav

In this example, the speaker is cut off before saying "stated".

How can we synthesize arbitrarily long sentences?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discourse page for further help. https://discourse.mozilla.org/c/tts

jeaye commented 2 years ago

It may be stale, but this issue is not fixed. It's easy to reproduce and a blocker for any serious work with TTS.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discourse page for further help. https://discourse.mozilla.org/c/tts

jeaye commented 2 years ago

It may be stale, but this issue is not fixed. It's easy to reproduce and a blocker for any serious work with TTS.

ethindp commented 2 years ago

Suffering this issue too. Unsure what to do to resolve it. Will try other models to see what happens, I suppose.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discourse page for further help. https://discourse.mozilla.org/c/tts

jeaye commented 2 years ago

It may be stale, but this issue is not fixed. It's easy to reproduce and a blocker for any serious work with TTS.

asiletto commented 1 year ago

I have the same problem here, long sentences get truncated.

It seems to be just a configuration as they say here https://github.com/thorstenMueller/Thorsten-Voice/issues/22

setting "max_decoder_steps": 10000 in the model config.json solved the problem