Open susnato opened 1 year ago
btw just leaving this here, in general I think we should add this fork: https://github.com/152334H/tortoise-tts-fast (it is the one which a lot of community members use)
This is also the one recently merged in Coqui: https://github.com/coqui-ai/TTS/pull/2547
@sanchit-gandhi do you think this could make sense?
TortoiseTTS continues to be one of the most popular OS TTS pipeline due to the high-quality speech samples it can generate. A variant of this model is likely powering ElevenLabs's most recent TTS service, which is arguably one of the best in the field (this is probably a fine-tuned / distilled version of TortoiseTTS, but we don't have exact details).
However, the model is natively very slow (clue is in the name 👀). It could make for a nice addition in diffusers
by leveraging speed-ups with flash attention, torch compile and half-precision inference.
IMO adding this pipeline definitely make sense if we have the modelling components ready in transformers
/ diffusers
already. If we have to add a whole new model to transformers
to support this, it might not be worth the time investment.
Figure 1 of the Tortoise TTS paper gives a nice overview of the pipeline:
@susnato as a first step, could you check which of these components are already available in transformers
/ diffusers
?
The AR model is said to be a vanilla GPT-2 architecture (which we have in transformers
). The CVLP model is a variant of CLIP - could you verify whether the architecture is the same (and so whether we can leverage CLIP from transformers
). The final thing to check is the diffusion model - do they do anything different to a vanilla SD LDM that we have in diffusers
?
Answering the above three will help gauge what we'll have to do to get this model added in the library
Hi @sanchit-gandhi I found that,
The CLVP model is slightly different from the CLIP that we have in transformers
, The "CLVP uses an architecture similar to the CLIP text encoder, except it uses two of them: one for text tokens and the other for MEL tokens". Since we already have the CLIP text encoder already implemented, it will be easy to implement CLVP.
The Sampling Algorithm they used for the Diffusion Decoder is from the paper - DENOISING DIFFUSION IMPLICIT MODELS, we already have DDIMPipeline in diffusers, so we can use that from there.
For the architecture of the Diffusion Decoder, they said that - "The diffusion model uses a bespoke architecture that combines residual convolutions with dense self-attention. It most closely resembles the traditional U-Net model used for DDPMs but without any upsampling or downsampling" . I think maybe we can modify the existing u-net implementation from the diffusers to achieve this.
They used UnivNet Vocoder which I believe we don't have in transformers/diffusers .
So if I am not wrong, I think most of the parts are already in transformers/diffusers, we need to implement mainly the vocoder and modify some of the existing things.
Hey @susnato,
Excellent! Thanks for getting back with those great findings 🤗
diffusers
(it's quite easy to define your upsampling / downsampling ratios, so we'll just pin these all to 1 for no upsampling / downsampling). Would we require changes for the residual convolutions + self-attention? Hi @sanchit-gandhi, sorry for the late reply, but I found that -
Actually CLVP is relatively closer to CLIP rather than CLAP, we (mostly) need to replace the Image Encoder with the Text Encoder in CLIP to get CLVP.
The Diffusion Decoder consists of stack of DiffusionLayer which is a ResBlock followed by an AttentionBlock, I think we can use AttnDownBlock2D from the diffusers which resembles the same function. Other than that there are some more components in the diffusion decoder but they are very easy to implement.
I would be interested in working on this if the maintainers think it is a good addition to the library :). (I don't have much experience with audio models but am interested in learning more about and working with them.)
Hi @dg845, for this addition we need to add some components prior to the actual pipeline addition, we can divide those parts among us but till then let us wait for @sanchit-gandhi 's verdict, if he agrees then we can discuss which parts we want to work on :heart: .
Hey @susnato - sorry for the late reply myself this time! Thanks again for reporting back with such useful findings 🤗
As I see it, the proposed steps for integrating the pipeline look as follows:
transformers
repo - feel free to open a PR already to start this integration. Regarding whether we add or update, adding a new standalone model with plenty of # Copied from
statements fits the transformers
design philosophy best here. Would be nice to document the changes that are required on the opening comment in the PR so that we can all follow the integration!transformers
- should be quite fast since this builds on the HiFi GAN vocoder already in transformers
(again we'll have a new modelling file and use plenty of # Copied from
statements)diffusers
- putting it all together to get the final model!Maybe you can get started with the CVLP model already @susnato? And @dg845 with the UnivNet vocoder? Would be cool to work as a team here to get the Tortoise TTS model added as quickly as possible!
I will start working on the CLVP
model as soon as possible.
Ditto for the UnivNet model :).
Legends - thanks both! Excited to work with you on the integration 🤗
Hey @dg845 - if you let me know your email (either here or privately by emailing sanchit<at>huggingface.co
, replacing <at>
with the required @
symbol) I can add you to a Slack channel to discuss the integration.
Opened a draft PR for the UnivNet vocoder in transformers
: https://github.com/huggingface/transformers/pull/24799. Also @sanchit-gandhi sent you an email :).
Happy to add the model to diffusers
- think it'd be a good fit here indeed!
It'll be the first time we have auto-regressive transformers inference in diffusers
so we probably need to iterate a bit on design, but since it's a very powerful model it'd indeed be great to have it in here :-)
@dg845 let me know if you need any help!
Model/Pipeline/Scheduler description
TorToise is a multi-voice text-to-speech system, which describes a way to apply recent advances in the image generative domain to speech synthesis. It would be great to have this model in diffusers. I would love to contribute this.
Open source status
Provide useful links for the implementation
Paper - https://arxiv.org/pdf/2305.07243.pdf Github repo - https://github.com/neonbjb/tortoise-tts
@sanchit-gandhi @Vaibhavs10