kanttouchthis / text_generation_webui_xtts

XTTSv2 Extension for oobabooga text-generation-webui
147 stars 17 forks source link

Generated audio swapping accents over time #7

Open spike4379 opened 1 year ago

spike4379 commented 1 year ago

Love the tts, this is amazing, however I thought I would bring up that despite the clip I use or its format of WAV or MP3, and it being perfect. The generated speech will always move between an american accent or a british. Is there a known way to label the audio sample as american or british so it knows which it should stick to?

If this topic needs to go elsewhere please let me know.

kanttouchthis commented 1 year ago

unfortunately this is an isssue with the underlying model. Try shortening your input audio to 5-9 seconds where the accent is very noticeable, that might help

erew123 commented 1 year ago

I had to get a pretty good quality, clean sample of someone for it to sound and remain sounding like them. There is a very occasional slip in the audio, but 95%+ sounds good. I also don't think longer clips necessarily give better results (though I've not done much testing on that and kept my samples around the 8-9 second mark).

Other things that may help:

I've not tested yet, but, I also wonder if you use an audio clip that was an AI generated audio, will that come out sounding right VS genuine audio of a real person. There could be a law of diminishing returns causing degradation in quality.

My current experience is, the better the sample, the more like the original person, their accent, nuances etc

EDIT - Changed the suggested Hz

kanttouchthis commented 1 year ago

the model samples at 24khz mono so that's probably what you want your source audio to be