American trained model produces English accent.

neonbjb / tortoise-tts

A multi-voice TTS system trained with an emphasis on quality

Apache License 2.0

13.15k stars 1.81k forks source link

American trained model produces English accent. #460

Open mmehrle opened 1 year ago

mmehrle commented 1 year ago

I've used 'clone-your-own-voice-tortoise-tts.ipynb' and followed more recent instructions to generate a full script via Google Colab. Works pretty well actually.

EXCEPT one issue: I've trained a model with a clear American accent (five to ten 6-10 sec voice samples). The output the engine produces however has a clear and very noticeable English accent. I mean it's not a huge problem but sometime it alternates between American and English accent, which is strange.

Any ideas what could be causing this?

ZeDespo commented 1 year ago

A little late to the party, but I can help with this. It's possible that you are not providing as varied of a training dataset that you think you are. Since you only have five bits of training data, I'd suggest iteratively introducing the training data files one at a time until you find the file that turns your TTS voice into a british accent.

Also, make sure that you have:

Removed large swathes of leading / trailing silences.
Normalized your training data so it's not too loud.
All bits of training data are the same length.

If all these don't work, try running your do_tts.py script with different presets to see if it works:

python tortoise/do_tts.py --text  "Big Chungus." --voice geralt --seed 420 --preset ultra_fast
python tortoise/do_tts.py --text  "Big Chungus." --voice glados --seed 420 --preset standard  # this will take a while
python tortoise/do_tts.py --text  "Big Chungus." --voice glados --seed 420 --preset high_quality  # ditto

Best of luck!

mmehrle commented 1 year ago

Thanks for this. I'm actually running it from a notebook. Do you know how to set the seed and the preset in code?

Also, I've noticed that there is quite some randomness to the process. I can run the exact same script several times and it produces varying input. Some of it is actually amazingly good, and what I've learned is that the input is hugely important. For example we had a voice actor read paragraphs in different moods, one more excited/engaging and the othe in a more somber/serious tone. The results reflected that, but again every render produced different results, which is fine I guess.

Another thing I noticed is that input quality matters hugely (rash in -> trash out). I took a VSL with a voice I liked and used AI to remove the background music. Then I cleaned it up a little and extracted voice samples. That's actually the one that produced the English accent for some reason, it hasn't happened with other American voices I've trained.

Anyway, I'll try cycling out some of the training data and see what happens. Thanks for the pointers.

entailz commented 1 year ago

Thanks for this! After some post processing and preset fiddling it's sounding good.