neonbjb / tortoise-tts

A multi-voice TTS system trained with an emphasis on quality
Apache License 2.0
13.07k stars 1.8k forks source link

Trained voices do not sound anything like sample clips #407

Open Horang1173 opened 1 year ago

Horang1173 commented 1 year ago

Hello! If I could get some help that would be greatly appreciated. I've trained several voices now using pretty clean audio clips from anime characters, etc, for personal/non-social use. The problem is that the voices sound like completely different people. Audio clips of a high school anime character sound like a cowgirl selling gems on late night TV. Is there any way to stray slightly closer in the correct direction?

Side note on the side of ethics, I use this 100% for personal projects and do not have any plans to branch out. This is a hobbyist passion for me to work on bettering my ability to generate AI content and potentially use it to generate locally controlled characters for niche games with friends. I know it won't be perfect, but when a school age man sounds like a college age cowgirl... it's a little weird. Any help would be appreciated!

Piercarlomaia commented 1 year ago

It seems to works better on female voices with 5 seconds audios recorded in english.

Yanall-Boutros commented 1 year ago

What's the total length of footage you're sending it? I have had some pretty good results feeding it a few hundred 10 second clips I've chopped up from speeches online