Open liuxiong21 opened 1 year ago
If you have multilingual datasets, you can do it just like training a multi-speaker model. Ofcourse there should be better ways to do this, but I had index 0 for Korean and 1 for English. You also have to process each filelist with the language-specific cleaner, and add the characters used in the filelist to text/symbols.py. To add Korean, I just had to concat Korean alphabet in the list of symbols like the following:
_kor_letters = 'ㄱㄴㄷㄹㅁㅂㅅㅇㅈㅊㅋㅌㅍㅎㄲㄸㅃㅆㅉㅏㅓㅗㅜㅡㅣㅐㅔ ' _punctuation = ';:,.!?¡¿—…~"«»“” ' _letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' _letters_ipa = "ɑɐɒæɓʙβɔɕçɗɖðʤəɘɚɛɜɝɞɟʄɡɠɢʛɦɧħɥʜɨɪʝɭɬɫɮʟɱɯɰŋɳɲɴøɵɸθœɶʘɹɺɾɻʀʁɽʂʃʈʧʉʊʋⱱʌɣɤʍχʎʏʑʐʒʔʡʕʢǀǁǂǃˈˌːˑʼʴʰʱʲʷˠˤ˞↓↑→↗↘'̩'ᵻ"
symbols = [_pad] + list(_punctuation) + list(_letters) + list(_letters_ipa) + list(_kor_letters)
I have tried Mandarin-English bilingual VITS, Instead of @heesuju multi-phoneme-set method, I had used the single IPA phoneme set, the result is not good. The Chinese utterance is not clear and some error pronounces.
I thinks there should be a condition to label the language into the VITS, I will try that with adding a language ID embedding to both TextEncoder and PosteriaEncoder, please waiting for my response.
And if somebody want other try, you can try just adding the language condition to the PosteriaEncoder. In fact I do not known which will be a resonable solution. But if the only PosteriaEncoder conditioned with language were ok, then we will need no such condition on inference.
Who can provide me with an implementation strategy?