added multi-lingual phonemizer

lucidrains / naturalspeech2-pytorch

Implementation of Natural Speech 2, Zero-shot Speech and Singing Synthesizer, in Pytorch

MIT License

1.26k stars 100 forks source link

added multi-lingual phonemizer #10

Closed manmay-nakhashi closed 1 year ago

manmay-nakhashi commented 1 year ago

integrated espeak integrated tokenizer text_to_ids

lucidrains commented 1 year ago

@manmay-nakhashi hey Manmay, thanks for the PR

i'm just a bit concerned, because i see that for all the files, you have """ from https://github.com/coqui-ai/TTS/""" at the top

did you take the code directly, without modification, from that repo?

manmay-nakhashi commented 1 year ago

@lucidrains no i have modified some of the files. some files like punctuations.py and files under english folders are the same , beacuse it's a generic code , i have modified a tokenizer to work with our need. all other files are standard.

manmay-nakhashi commented 1 year ago

@lucidrains rewriting all the funtions i think i can make every function more generic language wise.

manmay-nakhashi commented 1 year ago

to test the tokenizer

python3 utils/tokenizer.py

manmay-nakhashi commented 1 year ago

@lucidrains can you review this one ?

lucidrains commented 1 year ago

@manmay-nakhashi hi Manmay, thanks for continuing to polish this pull request

is there not a package out there that takes care of phoneme tokenization? (i'm assuming that this is what this pull request is for)

it would seem strange to me that the TTS field does not already have such a package in place, as it should be a more mature field relative to other subfields in ML

but if another TTS research / ML engineer were to attest that this PR is necessary, I can immediately merge

manmay-nakhashi commented 1 year ago

@lucidrains there is a library called phonemizer but it's only doing phonemization it doesn't include a word cleaning, number expansion, text cleaning etc. Also espeak supports many languages many languages containing the same phoneme representation make multi-lingual training much easier.

lucidrains commented 1 year ago

@manmay-nakhashi ahh ok sounds good, i'll let the pull request stay open for a bit more for feedback while preparing the framework to accept the phonemized ids. thank you for the explanation!

nivibilla commented 1 year ago

Correct me if I'm wrong, but doesn't phonemizer also support multiple languages? And has espeak as it's backend. Wouldn't it be cleaner to have this. But still do the text cleaning and preprocessing first?

manmay-nakhashi commented 1 year ago

@nivibilla if we use different phonemizer from phonemizer library , your base char set (phoneme symbols ) will be different , so it's safe to use a single phonemization library which supports multiple language under same phoneme char set :thinking:

nivibilla commented 1 year ago

Ah yes that makes sense. I havent looked into the phonemizer library in depth for multiple languages so it might be that they use different character sets for different languages

lucidrains commented 1 year ago

@manmay-nakhashi are any of the packages being depended on in this pull request not MIT licensed?

manmay-nakhashi commented 1 year ago

@lucidrains all are either MIT or Apache 2.0

lucidrains commented 1 year ago

@manmay-nakhashi ok, i'll let this sit over the weekend and plan on merging it next Monday! thank you!

lucidrains commented 1 year ago

@manmay-nakhashi how many unique tokens does the phoneme tokenizer produce?

manmay-nakhashi commented 1 year ago

@lucidrains 122

    _vowels = "iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻ"
    _non_pulmonic_consonants = "ʘɓǀɗǃʄǂɠǁʛ"
    _pulmonic_consonants = "pbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟ"
    _suprasegmentals = "'̃ˈˌːˑ. ,-"
    _other_symbols = "ʍwɥʜʢʡɕʑɺɧʲ"
    _diacrilics = "ɚ˞ɫ"
    _phonemes = _vowels + _non_pulmonic_consonants + _pulmonic_consonants + _suprasegmentals + _other_symbols + _diacrilics

lucidrains commented 1 year ago

@manmay-nakhashi thank you! will get this merged this evening