[Feature request] Add text-to-speech with SpeechT5

josephrocca commented 1 year ago

Name of the feature Speech to text using SpeechT5, which was recently added to Transformers.

Blog post: https://huggingface.co/blog/speecht5
Spaces demo: https://huggingface.co/spaces/Matthijs/speecht5-tts-demo
Models: https://huggingface.co/mechanicalsea/speecht5-tts
Github repo: https://github.com/microsoft/SpeechT5/
Paper: https://arxiv.org/abs/2110.07205
Speaker embedding creation: https://huggingface.co/mechanicalsea/speecht5-vc/blob/main/manifest/utils/prep_cmu_arctic_spkemb.py

Reason for request The brower's default TTS API is quite bad if you want to create an experience that works nicely across all browsers. Firefox's voices in particular are extremely robotic. Some applications require that the voice is consistent, and of a particular style/tone/etc. SpeechT5 allows you to create 512-dim speaker embeddings so you can use an arbitrary voice style.

Additional context

The model runs in realtime on the CPU (Pytorch), so with WebGPU we should easily have realtime generation on the web.
According the the above-linked models repo, the models are 600M (T5) and 300M (Hi-Fi-GAN), but I've just tried running it locally with the new docker integration on Hugging Face and it downloads a 585M model and a 50M model. So I'm not sure what's going on with the GAN size difference. Maybe they have quantized the GAN, but not T5? Hoping tha the T5 model can be quantized because that would move it from "reasonable" to "good" territory in terms of size. I'm assuming that it's currently in 16 bit format.

Example clip from the Spaces demo (this embedding is pretty monotone):

tmptgsysvc8.webm

xenova commented 1 year ago

Woah 🤯 This definitely sounds do-able! I'll look into it (and hopefully add it quite soon 💪 )

xenova commented 1 year ago

I've been looking into it more today, but it seems as though HF does not support text-to-speech in the pipeline function? It also appears that optimum doesn't support text-to-speech as a task (needed to convert to ONNX format).

Fortunately, the spaces demo you sent above includes some code I can use for testing (https://huggingface.co/spaces/Matthijs/speecht5-tts-demo/blob/main/app.py).

If you want, you could even open up a feature request / PR on the main transformers branch to add this.

josephrocca commented 1 year ago

Done: https://github.com/huggingface/transformers/issues/22487

josephrocca commented 1 year ago

Just came across this:

The audio quality here seems quite good for the model size.

kungfooman commented 1 year ago

@josephrocca Do you have any favorite model or do you use different models for different tasks?

I am very much looking forward to this :see_no_evil:

I guess the decision matrix would contain:

Number of available languages
Ability to handle emotive text
Size of the model
Transformers support

Just tested Bark: https://github.com/huggingface/transformers/issues/23036

sudo bark.webm

I like the ability to add emotion, just funny that it suddenly changed the voice/gender too :sweat_smile:

josephrocca commented 1 year ago

Yeah expressiveness/non-roboticness is the main factor for me. And next is inference speed. Size is probably not a big issue for my use cases - anything under 500mb is fine.

huggingface / transformers.js

[Feature request] Add text-to-speech with SpeechT5 #59