haoheliu / AudioLDM-training-finetuning

AudioLDM training, finetuning, evaluation and inference.
https://audioldm.github.io/audioldm2/
MIT License
196 stars 38 forks source link

A question about training with transcription #19

Closed wangjs9 closed 10 months ago

wangjs9 commented 10 months ago

I appreciate your wonderful work and the open-resourced codes. It helps me a lot! And I have a question. Do the codes provide the method to fine-tune the model with transcription? If not, I'd like to implement that part myself. But I am not sure whether I missed that part when viewing the codes.

Looking forward to your reply!

haoheliu commented 10 months ago

@wangjs9 Do you mean finetuning the model to perform text-to-speech tasks?

Tortoise17 commented 10 months ago

@haoheliu yes, i am also interested in that. But, I want to make transfer learning of a vocal style of a person in speech. If you can guide me

wangjs9 commented 10 months ago

Thank you for your reply!! Yes. I'd like to fine-tune a model that can perform the text-to-speech task well on my own dataset through the command line like:

audioldm2 -t "A female reporter is speaking full of emotion" --transcription "Wish you have a good day"

I have compared the codes here and the inference one(AudioLDM2). I did not find related codes.

I can implement it myself given your supremely clear codes and instructions. But I want to ask you and confirm it.

haoheliu commented 10 months ago

@Tortoise17 @wangjs9 The code for AudioLDM2 is in the repo already but the config has not been pushed and tested yet. You are most welcomed to make PR or other contributions.

I'm not sure about the transfer learning. If you like to copy a voice from a vocal prompt, you can do in-context learning (or continuation) with the autoregressive part of AudioLDM 2.

I suppose for the text-to-speech task it would not be painful to implement. I recommend you start by looking at this conditioning module https://github.com/haoheliu/AudioLDM-training-finetuning/blob/main/audioldm_train/conditional_models.py#L482, which perform AudioMAE prediction and prediction output will be used to condition the diffusion model. AudioLDM2 repo use similar (or say same) condition module and you can refer to the configuration there to implement the training configuration yaml file.

wangjs9 commented 10 months ago

Thank you for your reply and your detailed suggestions! And I am confident that the implementation can be smooth given your clear codes! I asked it because I was worried that I would do redundant work. Now I think that the main modification is the configuration.

Thank you again!! Looking forward to further wonderful work from you and your lab!!

Tortoise17 commented 10 months ago

@haoheliu This is a great hint. So it means, I can take my vocal to generate using speech to text with autoregressive. I would like to make some research on this as I have found this work top super great. Really a supreme work. One more question, we can train or fine-tune speech to text model for our own case? and for any other language? so far English works fine but it still needs refinement which I want to make.

Tortoise17 commented 10 months ago

@Tortoise17 @wangjs9 The code for AudioLDM2 is in the repo already but the config has not been pushed and tested yet. You are most welcomed to make PR or other contributions.

I'm not sure about the transfer learning. If you like to copy a voice from a vocal prompt, you can do in-context learning (or continuation) with the autoregressive part of AudioLDM 2.

I suppose for the text-to-speech task it would not be painful to implement. I recommend you start by looking at this conditioning module https://github.com/haoheliu/AudioLDM-training-finetuning/blob/main/audioldm_train/conditional_models.py#L482, which perform AudioMAE prediction and prediction output will be used to condition the diffusion model. AudioLDM2 repo use similar (or say same) condition module and you can refer to the configuration there to implement the training configuration yaml file.

@haoheliu is there any flag which I can use for in-context learning? How to add my input input.wav file for generation of specific output using speech-to-text model or any audioldm2-full model.