Add Tortoise TTS as a pipeline

susnato commented 1 year ago

Model/Pipeline/Scheduler description

TorToise is a multi-voice text-to-speech system, which describes a way to apply recent advances in the image generative domain to speech synthesis. It would be great to have this model in diffusers. I would love to contribute this.

Open source status

[X] The model implementation is available
[X] The model weights are available (Only relevant if addition is not a scheduler).

Provide useful links for the implementation

Paper - https://arxiv.org/pdf/2305.07243.pdf Github repo - https://github.com/neonbjb/tortoise-tts

@sanchit-gandhi @Vaibhavs10

Vaibhavs10 commented 1 year ago

btw just leaving this here, in general I think we should add this fork: https://github.com/152334H/tortoise-tts-fast (it is the one which a lot of community members use)

This is also the one recently merged in Coqui: https://github.com/coqui-ai/TTS/pull/2547

patrickvonplaten commented 1 year ago

@sanchit-gandhi do you think this could make sense?

sanchit-gandhi commented 1 year ago

TortoiseTTS continues to be one of the most popular OS TTS pipeline due to the high-quality speech samples it can generate. A variant of this model is likely powering ElevenLabs's most recent TTS service, which is arguably one of the best in the field (this is probably a fine-tuned / distilled version of TortoiseTTS, but we don't have exact details).

However, the model is natively very slow (clue is in the name 👀). It could make for a nice addition in diffusers by leveraging speed-ups with flash attention, torch compile and half-precision inference.

IMO adding this pipeline definitely make sense if we have the modelling components ready in transformers / diffusers already. If we have to add a whole new model to transformers to support this, it might not be worth the time investment.

Figure 1 of the Tortoise TTS paper gives a nice overview of the pipeline:

Screenshot 2023-06-28 at 18 31 07

@susnato as a first step, could you check which of these components are already available in transformers / diffusers?

The AR model is said to be a vanilla GPT-2 architecture (which we have in transformers). The CVLP model is a variant of CLIP - could you verify whether the architecture is the same (and so whether we can leverage CLIP from transformers). The final thing to check is the diffusion model - do they do anything different to a vanilla SD LDM that we have in diffusers?

Answering the above three will help gauge what we'll have to do to get this model added in the library

susnato commented 1 year ago

Hi @sanchit-gandhi I found that,

The CLVP model is slightly different from the CLIP that we have in transformers, The "CLVP uses an architecture similar to the CLIP text encoder, except it uses two of them: one for text tokens and the other for MEL tokens". Since we already have the CLIP text encoder already implemented, it will be easy to implement CLVP.
The Sampling Algorithm they used for the Diffusion Decoder is from the paper - DENOISING DIFFUSION IMPLICIT MODELS, we already have DDIMPipeline in diffusers, so we can use that from there.
For the architecture of the Diffusion Decoder, they said that - "The diffusion model uses a bespoke architecture that combines residual convolutions with dense self-attention. It most closely resembles the traditional U-Net model used for DDPMs but without any upsampling or downsampling" . I think maybe we can modify the existing u-net implementation from the diffusers to achieve this.
They used UnivNet Vocoder which I believe we don't have in transformers/diffusers .

So if I am not wrong, I think most of the parts are already in transformers/diffusers, we need to implement mainly the vocoder and modify some of the existing things.

sanchit-gandhi commented 1 year ago

Hey @susnato,

Excellent! Thanks for getting back with those great findings 🤗

Can you check whether CLAP in transformers is more or less the same architecture? Or whether we still need a new model, in which case is CLAP a more closely related model to start from?
Lovely!
Using a UNet without upsampling / downsampling is possible in diffusers (it's quite easy to define your upsampling / downsampling ratios, so we'll just pin these all to 1 for no upsampling / downsampling). Would we require changes for the residual convolutions + self-attention?
The purpose of a vocoder is to map from spectrogram to raw audio, so it doesn't need to be trained specifically for the model (works as a general spectrogram -> audio mapping). This means we could use the SpeechT5 HiFi GAN vocoder if we wanted, but the UnivNet vocoder used in Tortoise TTS appears to be both faster and better sounding, so would make sense to integrate this too. (cc @Vaibhavs10 here in case you know of a better option)

susnato commented 1 year ago

Hi @sanchit-gandhi, sorry for the late reply, but I found that -

Actually CLVP is relatively closer to CLIP rather than CLAP, we (mostly) need to replace the Image Encoder with the Text Encoder in CLIP to get CLVP.
The Diffusion Decoder consists of stack of DiffusionLayer which is a ResBlock followed by an AttentionBlock, I think we can use AttnDownBlock2D from the diffusers which resembles the same function. Other than that there are some more components in the diffusion decoder but they are very easy to implement.

dg845 commented 1 year ago

I would be interested in working on this if the maintainers think it is a good addition to the library :). (I don't have much experience with audio models but am interested in learning more about and working with them.)

susnato commented 12 months ago

Hi @dg845, for this addition we need to add some components prior to the actual pipeline addition, we can divide those parts among us but till then let us wait for @sanchit-gandhi 's verdict, if he agrees then we can discuss which parts we want to work on :heart: .

sanchit-gandhi commented 12 months ago

Hey @susnato - sorry for the late reply myself this time! Thanks again for reporting back with such useful findings 🤗

As I see it, the proposed steps for integrating the pipeline look as follows:

Add CVLP (or update CLIP) in the transformers repo - feel free to open a PR already to start this integration. Regarding whether we add or update, adding a new standalone model with plenty of # Copied from statements fits the transformers design philosophy best here. Would be nice to document the changes that are required on the opening comment in the PR so that we can all follow the integration!
Add the UnivNet vocoder to transformers - should be quite fast since this builds on the HiFi GAN vocoder already in transformers (again we'll have a new modelling file and use plenty of # Copied from statements)
Add the Tortoise TTS pipeline to diffusers - putting it all together to get the final model!

Maybe you can get started with the CVLP model already @susnato? And @dg845 with the UnivNet vocoder? Would be cool to work as a team here to get the Tortoise TTS model added as quickly as possible!

susnato commented 12 months ago

I will start working on the CLVP model as soon as possible.

dg845 commented 12 months ago

Ditto for the UnivNet model :).

sanchit-gandhi commented 12 months ago

Legends - thanks both! Excited to work with you on the integration 🤗

sanchit-gandhi commented 12 months ago

Hey @dg845 - if you let me know your email (either here or privately by emailing sanchit<at>huggingface.co, replacing <at> with the required @ symbol) I can add you to a Slack channel to discuss the integration.

dg845 commented 11 months ago

Opened a draft PR for the UnivNet vocoder in transformers: https://github.com/huggingface/transformers/pull/24799. Also @sanchit-gandhi sent you an email :).

patrickvonplaten commented 11 months ago

Happy to add the model to diffusers - think it'd be a good fit here indeed!

It'll be the first time we have auto-regressive transformers inference in diffusers so we probably need to iterate a bit on design, but since it's a very powerful model it'd indeed be great to have it in here :-)

patrickvonplaten commented 11 months ago

@dg845 let me know if you need any help!

huggingface / diffusers