How to generate speech condition on not only transcripts and descriptions but also audio clip?

Hi, Thanks for the open-source code. I want to generate speech conditioned on transcripts, descriptions, and audio clips by using the audioldm-gigaspech pre-trained model. However, I found the provided example only accepts transcripts and descriptions. Can you also release the example using not only transcripts and descriptions but also audio clips? or do you have some tips to modify the code to run the speech generation based on transcripts, descriptions, and audio clips?

Thanks in advance.

haoheliu / AudioLDM2

How to generate speech condition on not only transcripts and descriptions but also audio clip? #54