homebrewltd / ichigo

Llama3.1 learns to Listen
Apache License 2.0
1.1k stars 39 forks source link

planning: fast open source tts for ichigo #94

Open PodsAreAllYouNeed opened 4 days ago

PodsAreAllYouNeed commented 4 days ago

We need to replace the current fishspeech with better TTS model.

WIP Shortlist of Possible candidates:

Test sentence: I'm Ichigo, a local AI created by Homebrew Research. I'm here to help answer your questions and make your life easier.

Samples https://drive.google.com/drive/folders/1FbR5H7rqirHDgxbjxO8Zwhxsj5y4t_mq?usp=sharing

Name License Code Paper Comments
Tacotron2 BSD 3-Clause License https://github.com/NVIDIA/tacotron2 https://arxiv.org/pdf/1712.05884 https://colab.research.google.com/github/pytorch/pytorch.github.io/blob/master/assets/hub/nvidia_deeplearningexamples_tacotron2.ipynb This model is faster than real time, uses mel-spectrograms, which can be very fast. But it sounds really terrible compared to recent models. Probably not usable.
Hifi-GAN MIT License https://github.com/jik876/hifi-gan https://arxiv.org/abs/2010.05646 https://colab.research.google.com/github/pytorch/pytorch.github.io/blob/master/assets/hub/nvidia_deeplearningexamples_hifigan.ipynb#scrollTo=b3b54df5 Old, but this have been used by many of papers. Sounds better than tacotron but not by much
FastSpeech2
VITS
VALLE
NaturalSpeech2
Jets
MELLE
VALLE-2 Tried this using Amphion. It is not about to pronounces "Ichigo" and "AI" properly. Probably something went wrong with the phoneme conversion. E2/F5-TTS is abit better at this.
Voicebox
E2/F5-TTS MIT License https://github.com/SWivid/F5-TTS arxiv.org/abs/2410.06885 https://huggingface.co/spaces/mrfakename/E2-F5-TTS Generation seems pretty good, but not sure if it will be fast enough. Needs transcript of the reference text, F5-TTS needs speed set to 0.8 for better generation.
tikikun commented 2 days ago

Why the sample on f5--ts work, it seems everything else is pretty bad

hahuyhoang411 commented 1 day ago
Screenshot 2024-10-21 at 08 37 10

with f5 we can change the system prompt of Ichigo a bit and make it more nature

PodsAreAllYouNeed commented 1 day ago

Tested on TTS Arena and added to Drive:

Commercial ElevenLabs FishSpeech v1.4 PlayHT2.0 PlayHT3.0mini XTTSv2

Non-Commercial GPT-SoVITS (MIT License) MeloTTS (MIT License) (Multi-lingual, multi-accent) OpenVoicev2 (MIT License) Parler-TTS and Parler TTS Large(Apache-2.0) StyleTTS2 (MIT License)

unknown license VoiceCraftV2

PodsAreAllYouNeed commented 1 day ago

After testing these models, it seems F5-TTS is the only open-source TTS that can get the pronunciation of both "Ichigo" and reading out of the acronym "AI" correct. The commercial ones have no problem with this of course. The next question is then whether F5-TTS inference is going to be fast enough. Will update after some testing.