I modified VITS to training multi-lingual voices (english and chinese) by concat a language-specified embedding tensor emb_lang to text embedding emb_t. Everything keeps the same except the hidden channel of text encoder input changes from 192 to 196 (language-specified embedding dim = 4).
All losses in tensorboard except loss/g/fm decreases repidly at the first 1k steps, then keeps almost the same from 1k-60k steps. loss/g/fm keeps increasing.
I modified VITS to training multi-lingual voices (english and chinese) by concat a language-specified embedding tensor emb_lang to text embedding emb_t. Everything keeps the same except the hidden channel of text encoder input changes from 192 to 196 (language-specified embedding dim = 4).
All losses in tensorboard except loss/g/fm decreases repidly at the first 1k steps, then keeps almost the same from 1k-60k steps. loss/g/fm keeps increasing.
Anyone have similar issues? Thx.