Coqui/Mozilla TTS support #217

Open Hattshire opened 3 years ago

Hattshire commented 3 years ago

It would be great if on custom voices one could use a Coqui TTS server to serve natural sounding voices either locally or remotelly.

ken107 commented 3 years ago

Update: I have tested Coqui TTS. The voice quality is impressive, and quite stable. Using the "glow" model, I can synthesize a 500-character paragraph in about 5.6 seconds. Oddly, it takes about the same time whether I use Torch (cpu) or Torch (cu102) on my GTX 1070. 5.6s unfortunately isn't scalable for us.

The python script also adds an additional 10s of startup time, which could be eliminated by running it as a service, but that'd take some work. Finally, the models are not handling acronyms properly. They're using the "gruut" phonemizer library, which I guess is pronouncing the acronyms rather than spelling them out. A workaround is to preprocess the text and "spell out" the acronyms, e.g. TTS would become "tee tee ess".

kfatehi commented 1 year ago

Hi @ken107

I am using Coqui TTS (on a hacked-up version of my Riva TTS Proxy server, nothing to do with Riva, but reusing the transcoding and making it compatible with the read-aloud integration we've added) a Persian/Farsi model.

The model works quite well and outcompetes the SAPI 5 voice I had been using (Dariush Premium by Harpo Software) which throws a COM Error on a lot of new slang that I guess it had never been trained on directly, whereas the neural network based stuff pronounces rare words just fine. Interestingly, it rarely pronounces the same word in exactly the same way twice.

I was thinking maybe I should refactor that server project to not be Riva-specific and be able to house Coqui models as well.

The current architecture I have now is a hacked version of the Riva TTS Proxy exposes a Persian voice to ReadAloud which points to another server here (let's call this the Coqui model worker) which actually runs the Coqui voice. Coqui streams its WAV which the TTS Proxy transcodes as usual in the way ReadAloud expects.

One idea is to make more generic so that it just takes care of the network communication to new backends, transcoding those backends, managing these new voices, and easy consumption by frontends ReadAloud-like clients in their preferred format.

That way, users can deploy model workers (e.g. persian-tts-server, the riva stack, sapi 5 exposers, etc) and then enable them in the TTS proxy... then ReadAloud can just enumerate the voices exposed by the proxy.

Just an idea, let me know what you think. Right now I have a mess of proofs-of-concepts so I'll be waiting for feedback to know in which direction to cleanup and make things more shareable, but that's the idea I have in mind so far.

kfatehi commented 1 year ago

I took a look at

Looks like the voices are exposed via SAPI 5, which I wrote a backend for in the proxy (to access that Dariush voice). I found this library at first: pyttsx3 but it did not implement in-memory buffer/streaming of the WAV.

I don't know why RHVoice does not talk about a Mac OS standard voice, but it does describe a Linux one, so I could explore adding that too.

Doesn't ReadAloud already utilize SAPI 5 ? How else does it have access to the offline Microsoft voices? If this is true, then RHVoice should work out of the box. I suspect there is something missing here, though, because I did not see Dariush (SAPI 5 voice) which caused me to learn SAPI 5 in detail.

kfatehi commented 1 year ago

Finally, the models are not handling acronyms properly. They're using the "gruut" phonemizer library, which I guess is pronouncing the acronyms rather than spelling them out. A workaround is to preprocess the text and "spell out" the acronyms, e.g. TTS would become "tee tee ess".

I noticed that Nvidia Riva also has this problem. I haven't tried to address it but I like your idea. I don't think it works in all situations but most enough to be worth implementing.

kfatehi commented 1 year ago

Here is some performance information with the Persian model I am using ( persian-tts-female-vits). Note that I am using an Nvidia 4090.

The model takes about 6 seconds to load on initial startup before it's ready to handle synthesis requests.

 > Text splitted to sentences.
['صفحه 5 باخليج هميشگى فارس گزارشگزارش صفحه 7 وقتى متغيرهاى اقتصادى اجاره مسكن را تعيين مى كنند اقتصادياقتصادي تومان 5000 دوشنبه 18 ارديبهشت 1402 - 17 شوال 1444 - 8 مه 2023 - سال نودوهفتم - شماره 28384 - 16 صفحه به همراه 8 صفحه ضميمه - تك شماره * رئيسي در آيين افتتاح نمايشگاه توانمنديهاي صادراتي:']
 > Processing time: 0.4252307415008545
 > Real-time factor: 0.01276483644362068
INFO:werkzeug: - - [07/May/2023 23:08:51] "POST /synthesize HTTP/1.1" 200 -
 > Text splitted to sentences.
['اولين گام در توسعه روابط اقتصادي، معرفي ظرفيتها و توانمنديهاي داخلي و شناخت ظرفيت كشورهاي ديگر است * توليدكنندگان كشورمان در عرصه صنايع فولادي، پتروشيمي و صنايع ديگر، دستاوردهاي قابل توجهي داشتهاند * دانشـمندان ايرانـي امـروز بسـياري از علـوم از جملـه نظامـي و هستهاي را بوميسازي كردهاند * ركورد صادرات در دولت سيزدهم با عبور از 50 ميليارد دلار شكست * ايران به هيچوجه رشـد و توليد كشـور را به مذاكرات سياسـي گره نخواهد زد * مذاكرات سياسي عزتمندانه پيش خواهد رفت * سرپرسـت وزارت صنعـت: توسـعه تجـارت بـا كشـورهاي آسـياي جنوب شـرقي، آفريقا و كشـورهاي همسـايه در اولويت قرار دارد صفحه2صفحه16 سلاجقه:']
 > Processing time: 0.3436312675476074
 > Real-time factor: 0.006357812111571191
INFO:werkzeug: - - [07/May/2023 23:08:52] "POST /synthesize HTTP/1.1" 200 -