Open jozefchutka opened 1 year ago
Hi @jozefchutka I would like to work on this issue, I see multiple models under Implemented Models on your link, do you have any recommendation about which one to start first?
Hi @susnato , thanks for looking into this. I hope to eventually run TTS in browser via (transformers.js), based on which my recommendation would be to pick a model that would be suitable in terms of performance / size
Hi @jozefchutka thanks for replying, I was thinking about Speedy-Speech but I didn't see that model inside of TTS/tts/models
in dev branch, am I looking in wrong branch?
I have no idea honestly. But I have just discovered github provides very nice code browsing view, including search.
If its nowhere to find, it would be worth to reach out to 🐸TTS team
cc @sanchit-gandhi
Hey @jozefchutka and @susnato - Coqui were previously focused on providing strong open-source TTS checkpoints, however in the last year they pivoted to more end-user services (see https://twitter.com/coqui_ai/status/1638573847296499712). They haven't been open-sourcing these latest models, and as a result their open-source checkpoints have fallen by the wayside a bit compared to the latest TTS research (e.g. VALL-E, Bark, MQTTS). I would say that a year ago it would have been a very exciting addition, but now there are more performant checkpoints that are growing in popularity amongst the open-source community. I would recommend checking out the aforementioned models if you're interested in a TTS model integration! Also see related https://github.com/huggingface/transformers/issues/22487#issuecomment-1496340245
Hi @sanchit-gandhi thanks for replying! Actually I was going through the same issue and saw your comment -
Indeed, a TTS pipeline would be super helpful to run SpeechT5. We're currently planning on waiting till we have 1-2 more TTS models in the library before pushing ahead with a TTS pipeline, in order to verify that the pipeline is generalisable and gives a benefit over loading a single model + processor.
I was hoping to somehow contribute to the TTS pipeline, but now that you said
They haven't been open-sourcing these latest models, and as a result their open-source checkpoints have fallen by the wayside a bit compared to the latest TTS research (e.g. VALL-E, Bark, MQTTS)
is a TTS pipeiline still in queue or should I focus on others like https://paperswithcode.com/task/text-to-speech-synthesis ?
Hi @sanchit-gandhi @susnato thanks for the insights. If there are better alternatives please go for it.
IMO the TTS pipeline will be worth pursuing once the two ongoing TTS PRs are complete:
=> we'd then have three models on which to base the TTS pipeline!
Right now I think these are probably the most worthwhile TTS models to work on in transformers? There's also MQTTS: https://github.com/b04901014/MQTTS But that hasn't gained much traction. Do you know of any other recent TTS models that are gaining popularity amongst the community that we might have missed?
The only other bookmark I have is https://github.com/elevenlabs/elevenlabs-python , but that doesnt seem open model, just API? Worth for someone with better understanding in field to research.
As far as I understand, ElevenLabs is only a paid API @jozefchutka, but definitely a performant low-latency model. Interestingly a new ElevenLabs demo popped-up on the HF Hub: https://huggingface.co/spaces/elevenlabs/tts So potentially they're trying to increase their OS presence?
My understanding is the same
Hi @sanchit-gandhi my knowledge about recent TTS models is very limited, but I read about some of them maybe they are worth adding - how about Tacotron 2(an implementation by NVIDIA here) or Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling . Also I found some unofficial implementations for VALL-E - lifeiteng/vall-e and enhuiz/vall-e but both are without pretrained weights or TransformerTTS (an PaddlePaddle implementation here and weights here).
If they are not as interesting then I would like to implement MQTTS. What do you think?
There is one more project I have just discovered called Piper https://github.com/rhasspy/piper
@susnato , @sanchit-gandhi please let me know if interested, or does this need a separate issue opened?
Hi @jozefchutka we are currently integrating tortoise-tts
to HF diffusers.
I would be interested in adding this after this integration is over and also if it this model is approved by the maintainers.
Thanks for flagging @jozefchutka! Tortoise holds a lot of promise since eventually we'll be able to fine-tune it - I think this means in the long-run we'll be able to build on it more than piper? WDYT?
My idea is to use tts model on web via transformers.js. It seems piper has reasonably sized voice models (~50MB) and faster than realtime performance (probably 10x?).
Real-time factor: 0.03615920479326211 (infer=0.743479167 sec, audio=20.56126984126984 sec)
# 20 sec .wav was generated in 0.7 sec
I can not find .onnx models for tortoise-tts do you have any idea of size and performance?
Tortoise has it's name because it's pretty slow, even on GPU ;-)
Thats concerning, do you have any benchmark to share?
We should be able to speed it up quite a bit in diffusers
with torch compile, flash attention, and scheduler choice (similar to the optimisations presented in this blog post: https://huggingface.co/blog/audioldm2)
Model description
🐸TTS is a library for advanced Text-to-Speech generation. It's built on the latest research, was designed to achieve the best trade-off among ease-of-training, speed and quality. 🐸TTS comes with pretrained models, tools for measuring dataset quality and already used in 20+ languages for products and research projects.
Open source status
Provide useful links for the implementation
GithHub repo: https://github.com/coqui-ai/TTS Samples: http://erogol.com/ddc-samples/