matatonic / openedai-speech

An OpenAI API compatible text to speech server using Coqui AI's xtts_v2 and/or piper tts as the backend.
GNU Affero General Public License v3.0
464 stars 58 forks source link

Include MeloTTS or Openvoice #16

Open djdookie opened 4 months ago

djdookie commented 4 months ago

Is there a way to include and serve MeloTTS and/or OpenVoice? They're state-of-the-art TTS (and voice cloning) and pretty fast, even on CPU only.

https://github.com/myshell-ai/MeloTTS https://github.com/myshell-ai/OpenVoice

matatonic commented 4 months ago

Not yet, but I want to have the best options available so will take a look at these when I get some more time.

ground-creative commented 4 months ago

While we wait.

https://github.com/ground-creative/openvoice-api-python

bi1101 commented 1 month ago

In my opinion, Tortoise TTS currently offers the best balance between quality and speed. It can achieve up to 7x real-time generation, surpassing xtts, which is capped at 3x. In this video demonstration, the model generated a 20-second audio clip in just 3 seconds with optimization. It seems that performance improves even further with longer text inputs. In terms of audio quality, Tortoise TTS is on par with xtts. Additionally, the Tortoise repository is actively maintained and regularly updated, whereas Coqui has already shut down.

Another promising option is Parler TTS, which is backed by Hugging Face and has planned improvements for the future. One major advantage of Parler TTS is its support for batching, allowing it to handle high traffic more efficiently, faster than queuing and generating sample per sample.

matatonic commented 1 month ago

An older version had parler TTS support (original version) but I removed it because it just seemed like random voices, which doesn't fit this project. the new parler version with stable voice identities is back on my radar, but I haven't tested it yet for quality or speed.

Re tortoise, that's news to me that it's faster, it has always been slower, I'll give it another look.

matatonic commented 1 month ago

The openai speech API doesn't support batching according to the API reference, so I don't plan to include batch support.

For cases outside API compatibility, especially batching, I recommend you implement inference with the model directly in your code and not via a network API. It would be much more efficient.

bi1101 commented 1 month ago

The openai speech API doesn't support batching according to the API reference, so I don't plan to include batch support.

I think there's been a misunderstanding. When I mentioned batching, I was referring to the server intelligently switching to batching mode when it receives concurrent requests. This allows it to process those requests in parallel. From reviewing your code, I can see that there is parallelism implemented, but it isn’t fully optimized using Parler’s native code, which offers a significant performance boost in such cases.

matatonic commented 1 month ago

The openai speech API doesn't support batching according to the API reference, so I don't plan to include batch support.

I think there's been a misunderstanding. When I mentioned batching, I was referring to the server intelligently switching to batching mode when it receives concurrent requests. This allows it to process those requests in parallel. From reviewing your code, I can see that there is parallelism implemented, but it isn’t fully optimized using Parler’s native code, which offers a significant performance boost in such cases.

I think I get you now, yeah so to implement continuous batching for processing parallel requests, not batch processing of a single batched request.

I hadn't considered that yet, but it is a much better solution to parallel processing than the current setup. Thanks for the suggestion.