Open filip-michalsky opened 1 year ago
Cc: @sanchit-gandhi
Thanks for opening this issue @filip-michalsky! The way the AudioLDM2 pipeline is constructed, the only new bit is converting the VITS TTS encoder to HF Transformers' format. The rest should be compatible with the existing AudioLDM2 pipeline. Will take a look into this!
BTW I don't think this is super high priority @sanchit-gandhi - we don't see a lot of usage for AudioLDM2 (yet)
Agree - maybe we open this one up to the community for now?
Hi @sanchit-gandhi, I don't have much experience with speech and audio but I am happy to learn. Can I contribute to this issue? Thanks
Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
The current pipeline for AudioLDM 2 does not take in "transcript" field. Hence, it does not create phonemes and hence does not allow for text-to-speech generation.
https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2Pipeline
Currently, only text-to-music and text-to-audio are supported. The latent difussion model is not guided for creating phonemes as in the original implementation with these two checkpoints:
Here: https://github.com/haoheliu/AudioLDM2/blob/main/audioldm2/pipeline.py#L78
and here: https://github.com/haoheliu/AudioLDM2/blob/main/audioldm2/latent_diffusion/models/ddpm.py#L482
commandline from original repo:
audioldm2 -t "A female reporter is speaking full of emotion" --transcription "Wish you have a good day"
These two checkpoints naturally take phonemes into the batch so the checkpoints do consume "phoneme" as one of the fields in the batch natively.
Describe the solution you'd like A clear and concise description of what you want to happen.
Add the "transcription" input param to allow to choose a TTS model from the two checkpoints above and hence allow for TTS task.
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Original repo implementation - is very slow and unoptimized.
Additional context Add any other context or screenshots about the feature request here.
I believe the already implemented pipeline AudioLDM2 could be updated to take in the transcript field, update the batch, and load the additional two checkpoints trained on TTS task. However, I currently don't have enough knowledge to assess which part of the pipeline needs to be updated vs the original implementation in https://github.com/haoheliu/AudioLDM2/blob/main/audioldm2/latent_diffusion/models/ddpm.py#L482