neonbjb / tortoise-tts

A multi-voice TTS system trained with an emphasis on quality
Apache License 2.0
12.91k stars 1.78k forks source link

Zero Shot Intonation #16

Closed honestabelink closed 2 years ago

honestabelink commented 2 years ago

As we all have seen with the latest papers, how you prompt transformer models can greatly influence their outputs. See the typical DALL-E/Disco Diffusion prompts or the PALM paper's section on Chain-of-Thought Prompting.

Prompting engineering is model specific, effected by the training set.

As an example, prompts like so do not invoke intonations aligned with the text, instead the double quotes cause shifts away from the readers voice, as people likely do in the training set, reading the text as the character.

She said with a happy voice, "I start my new job today". They said happily with a happy voice, "I start my new job today".

Matching is not necessarily expected here, so I did some further testing generating samples prompted differently to see if your model can exhibit this behavior.

Here are my results.

output.zip

Here are my anecdotal findings.

Typical Sampling is a must if you care for expressiveness. Though there is a noticeable quality drop.

PROMPT A : She said with a sad voice, "I start my new job today". PROMPT B : It is so sad, I start my new job today. PROMPT C1 : Sad, I start my new job today. PROMPT C2 : Happy, I start my new job today.

Using A, there is a perceptual change in prosody almost in every sample over the "I start my new job today" section. This is expected, assuming aspects of the training set, where readers change to reading character quotes.

Quotes should be avoided unless you are going for an "audio book reader effect".

Surprisingly though, A never actually produces a "sad" sounding sentence. This could be for many reasons, I'll leave speculation now.

Both B and C usefully give nice intonation aligned with the prompt. With B using winning out but requiring more setup. C seems to be sufficient and simple enough you can use it automatically.

Further thoughts, just writing things.

This isn't my area, but I'm interested in tinkering around.

neonbjb commented 2 years ago

This is an interesting finding. One thing I really want to build on top of Tortoise is a word-alignment mechanism which tells you where in the output each word is spoken. If I built that, you could use this idea to do some really interesting prompt engineering. For example, use your prompt b to evoke emotion in the spoken text "I start my new job today", then only extract out the part where the model speaks that phrase.

This word-alignment is a pipe dream for now, though. I had hoped that analyzing the attention weights of the autoregressive transformer would be an easy win for this, but from my experiments that does not seem to be the case. :/ I'll keep tinkering, though..

I can't answer all of your questions, but I can offer some feedback:

ExponentialML commented 2 years ago

10

Referencing this since it ties in so well with what I've asked. Tried it an works pretty good! Could be some good ideas and conversation between the two.

honestabelink commented 2 years ago

@ftaker887 Great. Yeah not every sample is aligned as you would expect, but this is a very expressive model. Which shows the training set covers our emotion range rather well.

Though I don't know the balance of where this information comes from. Is it the autoregressive model, token by token carrying the expression, or the diffusion model through all the ambiguity settling down just somewhere nice.

The diffusion model clearly doesn't just copy the prosody from the reference clips it's provided. Which makes me think it's more reliant on the autoregressive models latents. So in this view the autoregressive understands the log likelihood over which people speak, which is what I guess we'd expect.

I'm going to test misaligning the diffusion model and the autoregressive model, and see what I find.

@neonbjb Thank you for your thoughts. There is a strong qualitive difference when the sentences gets complex, using CLVP and CVVP vs not, glad to see it getting more love.

The word aligments is a really cool idea, this crossed my mind also when doing my testing. "If only there was a way to remove the tokens used for conditioning before generation." But yeah, my experience has been attention heads are more hairy then we'd like.

Very good point on the vocoder too. It's impressive how expressive it is given your statement "vocoder as nearly lossless". So I take it the vocoder was frozen then, very neat. Diffusion models are such good generators.

neonbjb commented 2 years ago

The Univnet vocoder is actually a GAN. I am also extremely surprised how expressive it is given how small its training set was (LibriTTS and HiFiTTs IIRC). It renders voice samples even far from the training dataset (like strong accents or unique vocal ticks). I did indeed keep it frozen so you are essentially using an exact copy of the univnet that mindslab distributes on their github.

The meaning I take from that is that the conversion from MEL<->Audio is very low entropy such that a GAN can learn the entire latent space of the conversion between formats.

That being said, you are not wrong in "diffusion models are expressive". I couldn't agree more. I think they are incredibly strong and are going to revolutionize pretty much the entire generative model space (if you wouldn't consider that that already happened..)

honestabelink commented 2 years ago

Sorry I should have been more clear, by "Diffusion models are such good generators" I meant the MEL spectrogram from your diffusion decoder.

"The meaning I take from that is that the conversion from MEL<->Audio is very low entropy such that a GAN can learn the entire latent space of the conversion between formats."

This is such an important point, especially once you have pointed out the training set of the vocoder. This then, also answers my question of where the information is coming from? As I see it now the autoregressive model truly contains the information.

And I guess this would then fit in with your decision to simplify diffusion decoder.

What I was really getting at though, was how much better your MEL spectrograms are specifically because of the use of a diffusion base model to generate them.

Fitting into that story of "revolutionize pretty much the entire generative model space" very nicely.

Thanks for the insights 😄

Path-A commented 2 years ago

This is similar to what I've found! I've had success using expletives and using "I" to force anger (although, it often takes multiple attempts). As silly as this seems, something like "Holy !@#$! I am so pi@#ed! INSERT YOUR ANGRY TEXT HERE" will sometimes get you an angry tone for the text you actually care about.

neonbjb commented 2 years ago

OK so I've implemented this, I'd love some feedback. No documentation yet, but here is how it works:

When you feed text to Tortoise, you can add prompt engineering text in brackets, as suggested above. As an example: "[It is so sad,] I start my new job today." (should) produce a clip that sounds like "I start my new job today.".

How this works (and why there might be bugs I haven't discovered yet): It uses an alignment mechanism I wrote backed by wav2vec2. There is at least one potential failure mode: if the wav2vec2 horribly mis-transcribes the text, the alignment mechanism won't work well and you will get poor redaction.

Closing this for now. If you find any bugs, please open a new issue.