neonbjb / tortoise-tts

A multi-voice TTS system trained with an emphasis on quality
Apache License 2.0
12.55k stars 1.75k forks source link

Feature: Invoke Emotion With Two Audio Sources #10

Open ExponentialML opened 2 years ago

ExponentialML commented 2 years ago

Hello, thanks for the awesome project! This is very fun to mess around with.

One thing I've been having fun with doing is mix and mashing voices together. I've noticed that many TTS models lack emotion due to the nature of how they work. It gave me the idea of instead of mixing two voices together to create a new one, we could extract features from one, and invoke a type of style transfer if you will. I was thinking of a framework such as:

A. Source audio. Normal speaking. B. One or two clips of somebody speaking in an angry, sad, happy tone, etc. C. Source A references source B utterance, but not explicitly the words spoken, just tone of voice.

If it's possible to do so without training another model, I would definitely look into doing this on my free time if lead in the right direction. Cheers!

neonbjb commented 2 years ago

Absolutely!

I like the idea. This would be easy to do if I had access to a tone classifier. In such a case, I could probably use it as a prior for Tortoise to improve generation.

The other interesting thing you could do along this line of thinking is playing with the conditioning latent space (which currently encodes both voice & audio qualities as well as tone) and trying to build something that can change the "tone" components of the latent without modifying the voice & audio qualities.

If you have experience with ML stuff and are interested in pursuing this, let me know. I'd be glad to help you extract these conditioning latents (and show you how to manually feed them into tortoise as well).

ExponentialML commented 2 years ago

Absolutely!

I like the idea. This would be easy to do if I had access to a tone classifier. In such a case, I could probably use it as a prior for Tortoise to improve generation.

The other interesting thing you could do along this line of thinking is playing with the conditioning latent space (which currently encodes both voice & audio qualities as well as tone) and trying to build something that can change the "tone" components of the latent without modifying the voice & audio qualities.

If you have experience with ML stuff and are interested in pursuing this, let me know. I'd be glad to help you extract these conditioning latents (and show you how to manually feed them into tortoise as well).

Interesting idea.

I can give it a shot depending on how low level it is. If it's at the Pythonic level then I may be able to pursue, but if it requires me to know the mechanisms behind how it works (example being the equations in ML papers on arXiv), then I would have to study a quite a bit of CS on the side :).

bmc84 commented 2 years ago

Hi! Firstly, thank you so much for this amazing repo & all your hard work. I've quite easily got it all up & running under Windows 10 with my RTX 3070 (tbh I was surprised it worked, thought it might need more than 8GB)

I didn't want to start a new "issue" since this is somewhat related to this topic: would it be possible to apply a different voice to an existing audio file? So, you could take a recorded phrase (instead of starting with a text prompt) then apply a different speakers voice to it? I'm assuming that wouldn't be possible with the current functionality, but would that be something do-able in a later version?

StoneCypher commented 2 years ago

I was actually going to ask for this by a different mechanism

You say elsewhere that you have a plan for moving between two voices based on their latents

I had intended to just make a collection of voices by reading some source text in varying emotional cadences

If you were to just start by giving us the ability to smoothly shift between voices based on a token, then the question of figuring out when to do it could be pushed downstream

Consider the case of using this for generating video game dialogue, based on the way some or another character moves through a dialogue tree, or repeated variations for a character in a game like Civ, based on the reaction between the two nations (awe, disgust, fear, colloquial, hatred, distrust, etc.) At that point, emotional inference can come from just whatever is happening in the game, and need not - indeed, should not - come from the text at all.

I'm not saying that should never come from the text? But I am saying that I think they're separate problems.

If you would help us create text that can, itself, indicate when it's time to shift from voice 1 to voice 2, I think that's a big step in the right direction. Don't need anything fancy like easing; a linear interpolation between located tags would be enough.

No need to figure out what the "right" emotional cues are. They'll vary character to character. Just let us have string labels and we can figure it out.

Here's one quick example of how it could be done. It's maybe a little counter-intuitive, but I think it'd work really well.

Add some tag (here I'm using [voice: ... ] but whatever works) which effectively means "at this tag, start transitioning from the voice I'm in to the voice I'm naming here; the transition finishes at the next tag."

So, notice that I repeat newscasterExcited. That's because it's neutral here for the first five words, then at the first newscaster excited it starts tweening from the neutral it started in to the excited that's being requested; it then tweens from excited to excited, meaning it's not actually tweening, but just staying excited there. We do the same thing at the end for newscasterVeryAngry. That notation also means you can double-state for an immediate switch, as the tween phase is over a zero length band. (Alternatlely, you could have a more complex parser and start adding flags, but it's unnecessary, and I'd advise against it.)

python do_tts.py --text "I'm going to speak this [voice:newscasterExcited] and 
it's going to go great, [voice:newscasterExcited] like super great, 
[voice:newscasterSad] but it will get replaced with better 
[voice:newscasterAngry] and that person will feel my [voice:newscasterVeryAngry] 
ultimate indignant wrath [voice:newscasterVeryAngry] and my vengeance will be 
done!" --voice newscasterNeutral
neonbjb commented 2 years ago

Good idea. I need to experiment with voice shifting but if it works, I will probably incorporate this control scheme.