huggingface / transformers.js

State-of-the-art Machine Learning for the web. Run 🤗 Transformers directly in your browser, with no need for a server!
https://huggingface.co/docs/transformers.js
Apache License 2.0
11.86k stars 747 forks source link

[Feature request] Add text-to-speech with SpeechT5 #59

Closed josephrocca closed 1 year ago

josephrocca commented 1 year ago

Name of the feature Speech to text using SpeechT5, which was recently added to Transformers.

Reason for request The brower's default TTS API is quite bad if you want to create an experience that works nicely across all browsers. Firefox's voices in particular are extremely robotic. Some applications require that the voice is consistent, and of a particular style/tone/etc. SpeechT5 allows you to create 512-dim speaker embeddings so you can use an arbitrary voice style.

Additional context

Example clip from the Spaces demo (this embedding is pretty monotone):

tmptgsysvc8.webm

xenova commented 1 year ago

Woah 🤯 This definitely sounds do-able! I'll look into it (and hopefully add it quite soon 💪 )

xenova commented 1 year ago

I've been looking into it more today, but it seems as though HF does not support text-to-speech in the pipeline function? It also appears that optimum doesn't support text-to-speech as a task (needed to convert to ONNX format).

Fortunately, the spaces demo you sent above includes some code I can use for testing (https://huggingface.co/spaces/Matthijs/speecht5-tts-demo/blob/main/app.py).

If you want, you could even open up a feature request / PR on the main transformers branch to add this.

josephrocca commented 1 year ago

Done: https://github.com/huggingface/transformers/issues/22487

josephrocca commented 1 year ago

Just came across this:

The audio quality here seems quite good for the model size.

kungfooman commented 1 year ago

@josephrocca Do you have any favorite model or do you use different models for different tasks?

I am very much looking forward to this :see_no_evil:

I guess the decision matrix would contain:

Just tested Bark: https://github.com/huggingface/transformers/issues/23036

sudo bark.webm

I like the ability to add emotion, just funny that it suddenly changed the voice/gender too :sweat_smile:

josephrocca commented 1 year ago

Yeah expressiveness/non-roboticness is the main factor for me. And next is inference speed. Size is probably not a big issue for my use cases - anything under 500mb is fine.