Open secretsauceai opened 3 years ago
By Mimic, I understand you mean Mimic2, right? For the TTS I'm using Pico, I know that it is not AI-based, but it is fast and sounds good (at least in Spanish).
I haven't tried anything locally beyond the original Mimic. However, I would like a much better solution in the future. The voice is pretty 'robotic'. Mimic2 is based on tacotron, so until it is possible to run that in tflite or something similar it won't even get close to running in 'real time' on a raspi4.
How do you feel about the performance of Pico?
Lots of further research is needed, I myself need to look into this with a lot more depth..
For me the solution for the time being is being able to use online TTSs and have Pico and optionally eSpeak as fallback.
Regarding Mimic/Tacotron Mozilla improved deepspeech a ton and made it usable under a Raspi4, I hope Tacotron gets the same treatment.
Also, there are several variants of the TTS components. Mozilla-tts (which implements tacotron too) let's you play with them.
It would be interesting to compile a data set of responses and measure how long it takes to generate the TTS as a benchmark. Also a subjective 'how robotic is the voice', wouldn't be bad.
I am quite curious about performance benchmarks in TTS.
I saw from the above linked article about Mozilla TTS on a raspi, it runs 6 times slower than 'real time' with that configuration.
I have used TTS on raspi 3b+, festvox/flite works well and in real enough time to be useful for a screen reader, so it will work as an Assistant voice. festvox/festival is the interpreted version and does not run in real time on the 3b+ but you may find luck on the 4. The default voices for festival and flite both are not as good as some of the other voices you can download. My blind friend suggested RHvoice which I haven't tested personally on raspi but it works on Android so I would hope it works on raspi too.
We know Mimic can be run on a raspi4 in 'real time', we also know that Tacotron(2) probably will never run real time on a raspi4 (or perhaps?), so what does that leave us with?
Has anyone tried the Silero TTS models?