huggingface / speech-to-speech

Speech To Speech: an effort for an open-sourced and modular GPT4-o
Apache License 2.0
3.46k stars 362 forks source link

Suggestion: support GPT-SoVITS as TTS (Fast voice clone - so users can talk to his/her favorite voice other than generic AI voice). #92

Open insufficient-will opened 1 month ago

insufficient-will commented 1 month ago

Congratulate and many thanks first! I think the project has great potential into becoming a popular foundation. If you deem appropriate, would you support GPT-SoVITS as well?

I know there has already been lots of TTS support so far, but GPT-SoVITS has something different. It allows users to clone his/her favorite voice in a very efficient way.

Talking to AI is inspiring, but enjoying response from a particular voice is what intrigues people, and it could be one of the ultimate goals when people are willing to talk to a machine. GPT-SoVITS can do a decent voice clone with a few clips in a few minutes, thus making it an ideal addition to the existing TTS solutions.

Best wishes!

rs545837 commented 1 month ago

Did you ever take a look at StyleTTS2?

insufficient-will commented 1 month ago

It looks promising. I am in dire need for voice clone and multilanguage support. Here is a supplement of the issue.

Use scenario I am making AI voiced audio books and RAG. My audience is a bunch of Third-person Shooter Gacha gamers (Snowbreak). I will clone characters' voice which I will use in either voicing a book or responding to a question.

The TTS has to excel in voice clone. A pre-trained voice won't do because every audience don't want that voice, they need his/her particularly favorite ones.

And the TTS should support multilanguage scenarios, especially Chinese, English, Italian (the game has a heated character with Italian background) and if possible, Hindi (for an AI bot - I don't know why a bot is popular in a Gacha game, but it happens)

To expand this topic a bit. For professional use cases, like medicine consulting, a pre-trained voice will do, because the key is not the voice, but the accuracy of the content. But for everyday use cases, emotional engagement comes in. It won't limit to Gacha game.

Limitation Amount of voice clone training datasets. Training hardware requirement and time consumption. The fewer the better.

Current Solution GPT-SoVITS. Can do a decent clone with 10-50 clips, 3-10 seconds each, in 10 minutes (RTX 3090). But not perfect yet, explained below.

Current options Voice clone quality: everyone claims its best. I don't judge. But I've tied with some available methods, they don't come close to my current solution. CN support: ChatTTS, Melo, GPT-Sovits OK. Parler Not OK. EN support: Of course all are OK. Italian and Hindi: Of course none is OK.

It looks like StyleTTS2 could be my savior after all.

Did you ever take a look at StyleTTS2?

andimarafioti commented 1 month ago

Hey, I would be more than ok adding support for this TTS. If you want to do it I think it would be cool, I would review it 👍

We are still discussing a bit where to take this library next, thank you for sharing your ideas!

insufficient-will commented 1 month ago

Hey, I would be more than ok adding support for this TTS. If you want to do it I think it would be cool, I would review it 👍

We are still discussing a bit where to take this library next, thank you for sharing your ideas!

Right now I am using silly tavern, kobold, and GPT-Sovits to do a kind of speech-to-speech (with the voice I cloned). But it's slow even on a 3090, maybe 4090 can do better? I have tried this HF speech to speech on mac, it is a much better experience. Wherever you are heading, may fortune favor your path.

PaParaZz1 commented 3 weeks ago

Thanks for this awesome project. Based on the similar pipeline, we have released a Chinese Speech-to-Speech project named CleanS2S, supporting more interesting and streaming interactions.

Here is a snapshot of this project:

20241008-173750

Looking forward to more advices and feedbacks!