huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.15k stars 5.2k forks source link

Feature request: Update the pipeline for AudioLDM 2 so that 'transcript' can be consumed and text to speech created #4923

Open filip-michalsky opened 1 year ago

filip-michalsky commented 1 year ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

The current pipeline for AudioLDM 2 does not take in "transcript" field. Hence, it does not create phonemes and hence does not allow for text-to-speech generation.

https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2Pipeline

Currently, only text-to-music and text-to-audio are supported. The latent difussion model is not guided for creating phonemes as in the original implementation with these two checkpoints:

Here: https://github.com/haoheliu/AudioLDM2/blob/main/audioldm2/pipeline.py#L78

and here: https://github.com/haoheliu/AudioLDM2/blob/main/audioldm2/latent_diffusion/models/ddpm.py#L482

commandline from original repo: audioldm2 -t "A female reporter is speaking full of emotion" --transcription "Wish you have a good day"

These two checkpoints naturally take phonemes into the batch so the checkpoints do consume "phoneme" as one of the fields in the batch natively.

Describe the solution you'd like A clear and concise description of what you want to happen.

Add the "transcription" input param to allow to choose a TTS model from the two checkpoints above and hence allow for TTS task.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Original repo implementation - is very slow and unoptimized.

Additional context Add any other context or screenshots about the feature request here.

I believe the already implemented pipeline AudioLDM2 could be updated to take in the transcript field, update the batch, and load the additional two checkpoints trained on TTS task. However, I currently don't have enough knowledge to assess which part of the pipeline needs to be updated vs the original implementation in https://github.com/haoheliu/AudioLDM2/blob/main/audioldm2/latent_diffusion/models/ddpm.py#L482

sayakpaul commented 1 year ago

Cc: @sanchit-gandhi

sanchit-gandhi commented 1 year ago

Thanks for opening this issue @filip-michalsky! The way the AudioLDM2 pipeline is constructed, the only new bit is converting the VITS TTS encoder to HF Transformers' format. The rest should be compatible with the existing AudioLDM2 pipeline. Will take a look into this!

patrickvonplaten commented 1 year ago

BTW I don't think this is super high priority @sanchit-gandhi - we don't see a lot of usage for AudioLDM2 (yet)

sanchit-gandhi commented 1 year ago

Agree - maybe we open this one up to the community for now?

Bhavay-2001 commented 7 months ago

Hi @sanchit-gandhi, I don't have much experience with speech and audio but I am happy to learn. Can I contribute to this issue? Thanks