Open djdookie opened 4 months ago
Not yet, but I want to have the best options available so will take a look at these when I get some more time.
While we wait.
In my opinion, Tortoise TTS currently offers the best balance between quality and speed. It can achieve up to 7x real-time generation, surpassing xtts, which is capped at 3x. In this video demonstration, the model generated a 20-second audio clip in just 3 seconds with optimization. It seems that performance improves even further with longer text inputs. In terms of audio quality, Tortoise TTS is on par with xtts. Additionally, the Tortoise repository is actively maintained and regularly updated, whereas Coqui has already shut down.
Another promising option is Parler TTS, which is backed by Hugging Face and has planned improvements for the future. One major advantage of Parler TTS is its support for batching, allowing it to handle high traffic more efficiently, faster than queuing and generating sample per sample.
An older version had parler TTS support (original version) but I removed it because it just seemed like random voices, which doesn't fit this project. the new parler version with stable voice identities is back on my radar, but I haven't tested it yet for quality or speed.
Re tortoise, that's news to me that it's faster, it has always been slower, I'll give it another look.
The openai speech API doesn't support batching according to the API reference, so I don't plan to include batch support.
For cases outside API compatibility, especially batching, I recommend you implement inference with the model directly in your code and not via a network API. It would be much more efficient.
The openai speech API doesn't support batching according to the API reference, so I don't plan to include batch support.
I think there's been a misunderstanding. When I mentioned batching, I was referring to the server intelligently switching to batching mode when it receives concurrent requests. This allows it to process those requests in parallel. From reviewing your code, I can see that there is parallelism implemented, but it isn’t fully optimized using Parler’s native code, which offers a significant performance boost in such cases.
The openai speech API doesn't support batching according to the API reference, so I don't plan to include batch support.
I think there's been a misunderstanding. When I mentioned batching, I was referring to the server intelligently switching to batching mode when it receives concurrent requests. This allows it to process those requests in parallel. From reviewing your code, I can see that there is parallelism implemented, but it isn’t fully optimized using Parler’s native code, which offers a significant performance boost in such cases.
I think I get you now, yeah so to implement continuous batching for processing parallel requests, not batch processing of a single batched request.
I hadn't considered that yet, but it is a much better solution to parallel processing than the current setup. Thanks for the suggestion.
Is there a way to include and serve MeloTTS and/or OpenVoice? They're state-of-the-art TTS (and voice cloning) and pretty fast, even on CPU only.
https://github.com/myshell-ai/MeloTTS https://github.com/myshell-ai/OpenVoice