[New model] 🐸TTS advanced Text-to-Speech

jozefchutka commented 1 year ago

Model description

🐸TTS is a library for advanced Text-to-Speech generation. It's built on the latest research, was designed to achieve the best trade-off among ease-of-training, speed and quality. 🐸TTS comes with pretrained models, tools for measuring dataset quality and already used in 20+ languages for products and research projects.

Open source status

[X] The model implementation is available
[X] The model weights are available

Provide useful links for the implementation

GithHub repo: https://github.com/coqui-ai/TTS Samples: http://erogol.com/ddc-samples/

susnato commented 1 year ago

Hi @jozefchutka I would like to work on this issue, I see multiple models under Implemented Models on your link, do you have any recommendation about which one to start first?

jozefchutka commented 1 year ago

Hi @susnato , thanks for looking into this. I hope to eventually run TTS in browser via (transformers.js), based on which my recommendation would be to pick a model that would be suitable in terms of performance / size

susnato commented 1 year ago

Hi @jozefchutka thanks for replying, I was thinking about Speedy-Speech but I didn't see that model inside of TTS/tts/models in dev branch, am I looking in wrong branch?

jozefchutka commented 1 year ago

I have no idea honestly. But I have just discovered github provides very nice code browsing view, including search. Clipboard01

If its nowhere to find, it would be worth to reach out to 🐸TTS team

amyeroberts commented 1 year ago

cc @sanchit-gandhi

sanchit-gandhi commented 1 year ago

Hey @jozefchutka and @susnato - Coqui were previously focused on providing strong open-source TTS checkpoints, however in the last year they pivoted to more end-user services (see https://twitter.com/coqui_ai/status/1638573847296499712). They haven't been open-sourcing these latest models, and as a result their open-source checkpoints have fallen by the wayside a bit compared to the latest TTS research (e.g. VALL-E, Bark, MQTTS). I would say that a year ago it would have been a very exciting addition, but now there are more performant checkpoints that are growing in popularity amongst the open-source community. I would recommend checking out the aforementioned models if you're interested in a TTS model integration! Also see related https://github.com/huggingface/transformers/issues/22487#issuecomment-1496340245

susnato commented 1 year ago

Hi @sanchit-gandhi thanks for replying! Actually I was going through the same issue and saw your comment -

Indeed, a TTS pipeline would be super helpful to run SpeechT5. We're currently planning on waiting till we have 1-2 more TTS models in the library before pushing ahead with a TTS pipeline, in order to verify that the pipeline is generalisable and gives a benefit over loading a single model + processor.

I was hoping to somehow contribute to the TTS pipeline, but now that you said

They haven't been open-sourcing these latest models, and as a result their open-source checkpoints have fallen by the wayside a bit compared to the latest TTS research (e.g. VALL-E, Bark, MQTTS)

is a TTS pipeiline still in queue or should I focus on others like https://paperswithcode.com/task/text-to-speech-synthesis ?

jozefchutka commented 1 year ago

Hi @sanchit-gandhi @susnato thanks for the insights. If there are better alternatives please go for it.

sanchit-gandhi commented 1 year ago

IMO the TTS pipeline will be worth pursuing once the two ongoing TTS PRs are complete:

Bark #23375
FastSpeech2 #23439

=> we'd then have three models on which to base the TTS pipeline!

Right now I think these are probably the most worthwhile TTS models to work on in transformers? There's also MQTTS: https://github.com/b04901014/MQTTS But that hasn't gained much traction. Do you know of any other recent TTS models that are gaining popularity amongst the community that we might have missed?

jozefchutka commented 1 year ago

The only other bookmark I have is https://github.com/elevenlabs/elevenlabs-python , but that doesnt seem open model, just API? Worth for someone with better understanding in field to research.

sanchit-gandhi commented 1 year ago

As far as I understand, ElevenLabs is only a paid API @jozefchutka, but definitely a performant low-latency model. Interestingly a new ElevenLabs demo popped-up on the HF Hub: https://huggingface.co/spaces/elevenlabs/tts So potentially they're trying to increase their OS presence?

jozefchutka commented 1 year ago

My understanding is the same

susnato commented 1 year ago

Hi @sanchit-gandhi my knowledge about recent TTS models is very limited, but I read about some of them maybe they are worth adding - how about Tacotron 2(an implementation by NVIDIA here) or Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling . Also I found some unofficial implementations for VALL-E - lifeiteng/vall-e and enhuiz/vall-e but both are without pretrained weights or TransformerTTS (an PaddlePaddle implementation here and weights here).

If they are not as interesting then I would like to implement MQTTS. What do you think?

jozefchutka commented 1 year ago

There is one more project I have just discovered called Piper https://github.com/rhasspy/piper

MIT license
sounds natural https://rhasspy.github.io/piper-samples/
onnx voices available on https://huggingface.co/rhasspy/piper-voices/tree/v1.0.0/en/en_US being reasonable sized

@susnato , @sanchit-gandhi please let me know if interested, or does this need a separate issue opened?

susnato commented 1 year ago

Hi @jozefchutka we are currently integrating tortoise-tts to HF diffusers. I would be interested in adding this after this integration is over and also if it this model is approved by the maintainers.

sanchit-gandhi commented 1 year ago

Thanks for flagging @jozefchutka! Tortoise holds a lot of promise since eventually we'll be able to fine-tune it - I think this means in the long-run we'll be able to build on it more than piper? WDYT?

jozefchutka commented 1 year ago

My idea is to use tts model on web via transformers.js. It seems piper has reasonably sized voice models (~50MB) and faster than realtime performance (probably 10x?).

Real-time factor: 0.03615920479326211 (infer=0.743479167 sec, audio=20.56126984126984 sec)
# 20 sec .wav was generated in 0.7 sec

I can not find .onnx models for tortoise-tts do you have any idea of size and performance?

flozi00 commented 1 year ago

Tortoise has it's name because it's pretty slow, even on GPU ;-)

jozefchutka commented 1 year ago

Thats concerning, do you have any benchmark to share?

sanchit-gandhi commented 1 year ago

We should be able to speed it up quite a bit in diffusers with torch compile, flash attention, and scheduler choice (similar to the optimisations presented in this blog post: https://huggingface.co/blog/audioldm2)

huggingface / transformers