jaywalnut310 / vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
https://jaywalnut310.github.io/vits-demo/index.html
MIT License
6.48k stars 1.21k forks source link

Training for custom dataset #134

Open huydang2106 opened 1 year ago

huydang2106 commented 1 year ago

Have you or anyone tried VITS for other dataset of other language. Did it produce natural sound with high quality?. Any detail instruction to training for custom dataset. Thank you.

nivibilla commented 1 year ago

You can use my fork, its a work in progress. I haven't tuned any models yet but the loop works.

https://github.com/nivibilla/efficient-vits-finetuning

huydang2106 commented 1 year ago

Thank you, i can do the training with espnet, but the output quality is not as good as expected. So I am finding a trick or any advice for proper finetuning on dataset of other language.

nivibilla commented 1 year ago

I'm not sure about how to train on different languages. But there are a couple finetuning repos on Chinese and Japanese. That would probably help. My fork is only for English.

athenasaurav commented 1 year ago

You can use piper to train and infer for any number of languages (out of Box). We trained a VITS model in Hindi that sounds similar to English. We also trained on custom English in-house dataset (LJSpeech format) and good accented English. The Tonality is not accurate and more voices sound monotonous.