jaywalnut310 / vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
https://jaywalnut310.github.io/vits-demo/index.html
MIT License
6.48k stars 1.21k forks source link

How to train a multilingual model #152

Open liuxiong21 opened 1 year ago

liuxiong21 commented 1 year ago

Who can provide me with an implementation strategy?

heesuju commented 11 months ago

If you have multilingual datasets, you can do it just like training a multi-speaker model. Ofcourse there should be better ways to do this, but I had index 0 for Korean and 1 for English. You also have to process each filelist with the language-specific cleaner, and add the characters used in the filelist to text/symbols.py. To add Korean, I just had to concat Korean alphabet in the list of symbols like the following:

_kor_letters = 'ㄱㄴㄷㄹㅁㅂㅅㅇㅈㅊㅋㅌㅍㅎㄲㄸㅃㅆㅉㅏㅓㅗㅜㅡㅣㅐㅔ ' _punctuation = ';:,.!?¡¿—…~"«»“” ' _letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' _letters_ipa = "ɑɐɒæɓʙβɔɕçɗɖðʤəɘɚɛɜɝɞɟʄɡɠɢʛɦɧħɥʜɨɪʝɭɬɫɮʟɱɯɰŋɳɲɴøɵɸθœɶʘɹɺɾɻʀʁɽʂʃʈʧʉʊʋⱱʌɣɤʍχʎʏʑʐʒʔʡʕʢǀǁǂǃˈˌːˑʼʴʰʱʲʷˠˤ˞↓↑→↗↘'̩'ᵻ"

symbols = [_pad] + list(_punctuation) + list(_letters) + list(_letters_ipa) + list(_kor_letters)

JohnHerry commented 8 months ago

I have tried Mandarin-English bilingual VITS, Instead of @heesuju multi-phoneme-set method, I had used the single IPA phoneme set, the result is not good. The Chinese utterance is not clear and some error pronounces.

I thinks there should be a condition to label the language into the VITS, I will try that with adding a language ID embedding to both TextEncoder and PosteriaEncoder, please waiting for my response.

And if somebody want other try, you can try just adding the language condition to the PosteriaEncoder. In fact I do not known which will be a resonable solution. But if the only PosteriaEncoder conditioned with language were ok, then we will need no such condition on inference.