jaywalnut310 / vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
https://jaywalnut310.github.io/vits-demo/index.html
MIT License
6.73k stars 1.24k forks source link

How to finetune the given pre-trained model? #105

Open apzl opened 1 year ago

apzl commented 1 year ago

I was looking to finetune the vits model using some custom data, but noticed that the released model is only the generator, and discriminator is also required to continue training from the checkpoint. Is the discriminator model available somewhere or is there any way to finetune the available model?

nikich340 commented 1 year ago

No discriminator model, so no way for fine-tuning.

iamkhalidbashir commented 1 year ago

No discriminator model, so no way for fine-tuning.

The pre-trained model in this repo might not have a discriminator but the one at coqui-tts does have it. And I get good results when fine-tuning that vits model to a new voice with few steps.

skyler14 commented 1 year ago

No discriminator model, so no way for fine-tuning.

The pre-trained model in this repo might not have a discriminator but the one at coqui-tts does have it. And I get good results when fine-tuning that vits model to a new voice with few steps.

where is the coqui discriminator?

CookiePPP commented 1 year ago

You can also use the hifi-gan pretrained discriminator. VITS uses hifi-gan's code. https://github.com/jaywalnut310/vits/blob/2e561ba58618d021b5b8323d3765880f7e0ecfdb/models.py#L364

helgadnestr commented 1 year ago

The pre-trained model in this repo might not have a discriminator but the one at coqui-tts does have it. And I get good results when fine-tuning that vits model to a new voice with few steps.

@iamkhalidbashir Did you do fine-tuning with pretrained model on vctk (or LJ?) and with additional speaker? And about coqui-tts, can you tell where is the discriminator? Downloaded model look like generator.

nivibilla commented 1 year ago

Hi, everyone, following the suggestion from @CookiePPP , I have a fork with a running loop for finetuning(only for LJSpeech atm). I am using the LJSpeech generator from this repo and I extracted the discriminator from Hifi-Gan and amended the code slightly to make it work.

https://github.com/nivibilla/efficient-vits-finetuning

I would love to get help with finetuning as I'm currently limited to just using colab lol.