planning: fast open source tts for ichigo

PodsAreAllYouNeed commented 4 days ago

We need to replace the current fishspeech with better TTS model.

WIP Shortlist of Possible candidates:

Amphion (https://github.com/open-mmlab/Amphion) <-this is a framework
Tacotron2 (https://github.com/NVIDIA/tacotron2)
Hifi-GAN (https://github.com/jik876/hifi-gan)
MELLE (https://arxiv.org/pdf/2407.08551)
VALLE-2 (https://arxiv.org/pdf/2406.05370)
Voicebox (https://arxiv.org/pdf/2306.15687)
F5-TTS (https://arxiv.org/pdf/2410.06885)

Test sentence: I'm Ichigo, a local AI created by Homebrew Research. I'm here to help answer your questions and make your life easier.

Samples https://drive.google.com/drive/folders/1FbR5H7rqirHDgxbjxO8Zwhxsj5y4t_mq?usp=sharing

Name	License	Code	Paper	Comments
Tacotron2	BSD 3-Clause License	https://github.com/NVIDIA/tacotron2	https://arxiv.org/pdf/1712.05884	https://colab.research.google.com/github/pytorch/pytorch.github.io/blob/master/assets/hub/nvidia_deeplearningexamples_tacotron2.ipynb This model is faster than real time, uses mel-spectrograms, which can be very fast. But it sounds really terrible compared to recent models. Probably not usable.
Hifi-GAN	MIT License	https://github.com/jik876/hifi-gan	https://arxiv.org/abs/2010.05646	https://colab.research.google.com/github/pytorch/pytorch.github.io/blob/master/assets/hub/nvidia_deeplearningexamples_hifigan.ipynb#scrollTo=b3b54df5 Old, but this have been used by many of papers. Sounds better than tacotron but not by much
FastSpeech2
VITS
VALLE
NaturalSpeech2
Jets
MELLE
VALLE-2				Tried this using Amphion. It is not about to pronounces "Ichigo" and "AI" properly. Probably something went wrong with the phoneme conversion. E2/F5-TTS is abit better at this.
Voicebox
E2/F5-TTS	MIT License	https://github.com/SWivid/F5-TTS	arxiv.org/abs/2410.06885	https://huggingface.co/spaces/mrfakename/E2-F5-TTS Generation seems pretty good, but not sure if it will be fast enough. Needs transcript of the reference text, F5-TTS needs speed set to 0.8 for better generation.

tikikun commented 2 days ago

Why the sample on f5--ts work, it seems everything else is pretty bad

hahuyhoang411 commented 1 day ago

with f5 we can change the system prompt of Ichigo a bit and make it more nature

PodsAreAllYouNeed commented 1 day ago

Tested on TTS Arena and added to Drive:

Commercial ElevenLabs FishSpeech v1.4 PlayHT2.0 PlayHT3.0mini XTTSv2

Non-Commercial GPT-SoVITS (MIT License) MeloTTS (MIT License) (Multi-lingual, multi-accent) OpenVoicev2 (MIT License) Parler-TTS and Parler TTS Large(Apache-2.0) StyleTTS2 (MIT License)

unknown license VoiceCraftV2

PodsAreAllYouNeed commented 1 day ago

After testing these models, it seems F5-TTS is the only open-source TTS that can get the pronunciation of both "Ichigo" and reading out of the acronym "AI" correct. The commercial ones have no problem with this of course. The next question is then whether F5-TTS inference is going to be fast enough. Will update after some testing.

homebrewltd / ichigo

planning: fast open source tts for ichigo #94