huggingface / parler-tts

Inference and training library for high-quality TTS models.
Apache License 2.0
4.23k stars 418 forks source link

Voice Consistency Working Pretty Well -- Plus Zero-Shot Cloning! #139

Open apresence opened 1 day ago

apresence commented 1 day ago

I've managed to get a POC with voice consistency working pretty well. Along the way, I've figured out how to do ok-ish zero-shot voice cloning, too. It took drawing on tidbits spread between several issues posted here, the HF repos, the various github sources linked here and there, and about two weeks of experimentation on my part to get going.

Here is an example of zero-shot voice cloning. Between each sentence, I alternate ground truth and Parler TTS audio between the left and right channels. I also lead the ground truth audio with an upwards tone, and Parler with a downwards tone. I did this primarily for my own purposes so I could compare them more closely myself.

The ground truth audio is from a YouTube interview found here.

Only a 5-second snippet of ground truth audio was required to do the clone. Each sentence in the audio sample is a new Parler-TTS generation using text from the audio transcript. As you can hear, the consistency is pretty good. It's even better for voices in the training dataset.

For comparison, here an example comparing cloning vs non-cloning generation. All the settings are the same between the two, only the cloning feature being on or off differs.

Code, credits and further details forthcoming -- I have to clean up things and get rid of some bugs first for fear that the code-shamers will eat me alive. 😅

apresence commented 22 hours ago

Here is an example with the new voice steering feature on, and one with it off.

It's a simple on/off setting. Other than that, you'd use Parler-TTS just like you normally would. I've also added the ability to save voices you like so you can reuse them later, even between program executions.

Again, each sentence is a separate generation. With steering on, the voice consistency is pretty good. With it off, it varies considerably.

The model, voice description, seed, etc. are all the same between the two examples, only the new steering feature was turned on or off.

apresence commented 17 hours ago

Here's an updated voice clone example. I had used mini before because it's more consistent with it's production. Although it doesn't sound as good, I was able to one-shot it.

Large takes a lot of wrangling to get it to behave, so it took a few passes. It could be that my source audio is not good enough (background hum, mic pops, echos).

Anyway, this is pretty good for a quick POC!

apresence commented 10 hours ago

I got Parler-TTS zero-shot crying now. Check it out here. 100% of this audio was generated by Parler-TTS, along with some light editing in Audacity,