neonbjb / tortoise-tts

A multi-voice TTS system trained with an emphasis on quality
Apache License 2.0
12.55k stars 1.75k forks source link

sentence repetition #237

Open mb-alex-b-c opened 1 year ago

mb-alex-b-c commented 1 year ago

I have just started playing this Tortoise TTS.

I have been mainly using the following Hugging Face Space to do some initial testing and experiment. https://huggingface.co/spaces/mdnestor/tortoise

The (zero-shot) voice cloning capability is better than what I expected. Yet, when the input text has more than 3 or 4 sentences, it tends to have 2 issues: (1) A part of sentence will be repeated in the generated audio (2) Certain part of sentence will become a random "noise" ("noise" as in loosely speaking, not in sense of Diffusion model)?

Do other people face similar issue? Is it more related to that Hugging Face Space environment issue?

mb-alex-b-c commented 1 year ago

Here is an example:

image

Among the input text: "Come and sit down at the dinner table. Come closer and smell the roasted chicken. I want you to get a good sniff. I want you to feel how delicious the chicken smell. "

The last sentence "I want you to feel how delicious the chicken smell" was repeated twice.

Here is the link to the generated wav file.

neonbjb commented 1 year ago

This is just something the model does, likely because it wasn't trained on many instances of long speech (because I trained on consumer GPUs with low memory). There's not much you can do other than generate more examples and prune out the bad ones or shorten your inputs.

mb-alex-b-c commented 1 year ago

@neonbjb ... thanks for the answer. I guess I would opt for the approach of "shortening the inputs."

One more follow up confirmation question: Sometimes, it got some random noises in between sentences. It seems to me that it is more correlated to certain punctuation marks e.g. "?" or period in between "A. I."

Is it also related to lack of longer sentences in the Training Data Set? Thanks again.

Congratulations on that potential 10b investment from MSFT to OpenAI. :)

neonbjb commented 1 year ago

A trained model is simply a mirror of it's dataset, so the answer is definitely "yes, it's because of the dataset". With that said, I don't know the specific reason behind this one.

xenotropic commented 1 year ago

Hi James (or anyone else): is there any chance the length_penalty or repetition_penalty params to tts() would have any effect on these? If so do you have any guesses as to what ranges to try? I can see the defaults in the method definiton but would be kind of stabbing in the dark guessing myself.

neonbjb commented 1 year ago

It's possible. I can only recommend you try them out and see. For values - I'd start by doubling them repeatedly until the system breaks, then backing down. It'd best if you have a few test phrases that reliably reproduce the issue while trying this.

I did take this behavior into account when originally tuning these values, but (as you'll notice) dialing them up affects the quality of speech, so picking the right one is a balancing act.

AgainPsychoX commented 11 months ago

I have exact same issue :/