Open spike4379 opened 11 months ago
unfortunately this is an isssue with the underlying model. Try shortening your input audio to 5-9 seconds where the accent is very noticeable, that might help
I had to get a pretty good quality, clean sample of someone for it to sound and remain sounding like them. There is a very occasional slip in the audio, but 95%+ sounds good. I also don't think longer clips necessarily give better results (though I've not done much testing on that and kept my samples around the 8-9 second mark).
Other things that may help:
I've not tested yet, but, I also wonder if you use an audio clip that was an AI generated audio, will that come out sounding right VS genuine audio of a real person. There could be a law of diminishing returns causing degradation in quality.
My current experience is, the better the sample, the more like the original person, their accent, nuances etc
EDIT - Changed the suggested Hz
the model samples at 24khz mono so that's probably what you want your source audio to be
Love the tts, this is amazing, however I thought I would bring up that despite the clip I use or its format of WAV or MP3, and it being perfect. The generated speech will always move between an american accent or a british. Is there a known way to label the audio sample as american or british so it knows which it should stick to?
If this topic needs to go elsewhere please let me know.