Fine-tuning voice-cloning capability of metavoice

abhijeethp commented 4 months ago

Hey Team, Can anyone help me understand the following regarding the metavoice model fine-tuning process? https://github.com/metavoiceio/metavoice-src/tree/main?tab=readme-ov-file#finetuning

For fine-tuning the mode what is the minimum and maximum audio length I can use that is allowed by the system?
The fine-tuning script takes only 2 files as input -- a speech (audio) file and it's transcription. How is this possible? is the SiSNR calculated against the same audio?
I want fine-tune the voice cloning aspect of metavoice if possible. Is there anything extra I need to implement to do this?

Arman12345677 commented 4 months ago

Old man voice

lucapericlp commented 3 months ago

Hey @abhijeethp, sorry for only getting to this now, we've seen people finetuning using chunks of 5-10s audio in their training datasets (but it's not a hard range). We're not calculating SiSNR as part of finetuning - are you asking whether using the same audio is appropriate?

Re finetuning the voice cloning, you should be all good if you follow the finetuning guide with a solid dataset & play around with the hyperparameters and then use a good reference clip upon inference.

metavoiceio / metavoice-src

Fine-tuning voice-cloning capability of metavoice #137