jaywalnut310 / vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
https://jaywalnut310.github.io/vits-demo/index.html
MIT License
6.91k stars 1.27k forks source link

all loss keeps almost the same during training when using VITS to train multi-lingual datasets #83

Open zhufeijuanjuan opened 2 years ago

zhufeijuanjuan commented 2 years ago

I modified VITS to training multi-lingual voices (english and chinese) by concat a language-specified embedding tensor emb_lang to text embedding emb_t. Everything keeps the same except the hidden channel of text encoder input changes from 192 to 196 (language-specified embedding dim = 4).

All losses in tensorboard except loss/g/fm decreases repidly at the first 1k steps, then keeps almost the same from 1k-60k steps. loss/g/fm keeps increasing.

Anyone have similar issues? Thx.