thorstenMueller / Thorsten-Voice

Thorsten-Voice: A free to use, offline working, high quality german TTS voice should be available for every project without any license struggling.
http://www.thorsten-voice.de
Creative Commons Zero v1.0 Universal
557 stars 52 forks source link

Improve Audio Generation Speed #30

Closed r4nc0r closed 2 years ago

r4nc0r commented 2 years ago

Hi, first of all thank you very much for your contribution!

I'm trying to build a realtime voice assistant, for which I use different tools for stt, nlp and tts. I would love to use your voice for this, but the on the fly audio generation is a bit slow with your tacotron2 model.

I found this comparison https://github.com/coqui-ai/TTS/discussions/522 Is there any way to speed up the audio generation to similar values of the english models?

thorstenMueller commented 2 years ago

@domcross and i are working on new/better models using HifiGAN vocoder. Samples available on Thorsten-Voice project website. These models might be faster than the current one available. But maybe you should check work by @synesthesiam with larynx. My voice is available there too and it's really fast.

Did you test with "WaveGrad" or "Fullband-MelGAN" vocoder (Fullband-MelGAN is way faster).

r4nc0r commented 2 years ago

Thanks for your quick reply and for pointing me in the right direction!

I just used your model with the parameters specified in the readme: tts-server --model_name tts_models/de/thorsten/tacotron2-DCA

thorstenMueller commented 2 years ago

I tried following pip3 instal tts==0.5.0 and run tts-server --model_name tts_models/de/thorsten/tacotron2-DCA . Got an RTF around 0,6 - 1 on my notebook cpu which i think isn't too bad. What RTF do you have?

Just if you're interested in: https://www.thorsten-voice.de/2022/03/20/vergleich-thorsten-aktuell-mit-dem-neuen-modell/

r4nc0r commented 2 years ago

I just did that with the addition of --show_details SHOW_DETAILS and my RTF is about 0,6:

 > Processing time: 3.101564407348633
 > Real-time factor: 0.5756691513639508

I use a 12 Core Ryzen 3000 Processor. But the Processing time of 3s is extremly high given my use case of generating just in time responses for my voice Assistant. I build a workaround wich caches most wav files, but if I generate Responses with variable in the text this doenst work.

Also i would love to use your new model, is there a way to use it?

thorstenMueller commented 2 years ago

The new model is not released yet. I'll keep community updated on release date on Twitter or my Youtube channel. I'd recommend you taking a look larynx as it's designed for small compute power (like a raspberry) and my german voice is available too.

synesthesiam commented 2 years ago

@r4nc0r Keep watch for the release of Mimic 3 (samples), which should be this month. You should get a 8-10x speedup with it; I typically get an RTF of 0.03, but I'm also on a Ryzen 5950X.

thorstenMueller commented 2 years ago

Also i would love to use your new model, is there a way to use it?

Hi @r4nc0r , you can download model and config on @coqui-ai prerelease 0.7.0 here: https://github.com/coqui-ai/TTS/releases Easy pip based installation will follow when final 0.7.0 will be released.

Keep watch for the release of Mimic 3

You can play around with beta of Mimic 3 with my german voice (and some more german voices) as mentioned by @synesthesiam: https://mycroft.ai/blog/mimic-3-preview/

thorstenMueller commented 2 years ago

As Mimic 3 is already released you can easily use this. You can watch this video on how to set it up and use it and/or check official doc.

If you want to use Coqui TTS (little bit slower, but better quality) you can do this by:

I close this issue for now, but feel free top reopen if you have further questions.