Closed manmay-nakhashi closed 1 year ago
@manmay-nakhashi hey Manmay, thanks for the PR
i'm just a bit concerned, because i see that for all the files, you have """ from https://github.com/coqui-ai/TTS/"""
at the top
did you take the code directly, without modification, from that repo?
@lucidrains no i have modified some of the files. some files like punctuations.py and files under english folders are the same , beacuse it's a generic code , i have modified a tokenizer to work with our need. all other files are standard.
@lucidrains rewriting all the funtions i think i can make every function more generic language wise.
to test the tokenizer
python3 utils/tokenizer.py
@lucidrains can you review this one ?
@manmay-nakhashi hi Manmay, thanks for continuing to polish this pull request
is there not a package out there that takes care of phoneme tokenization? (i'm assuming that this is what this pull request is for)
it would seem strange to me that the TTS field does not already have such a package in place, as it should be a more mature field relative to other subfields in ML
but if another TTS research / ML engineer were to attest that this PR is necessary, I can immediately merge
@lucidrains there is a library called phonemizer but it's only doing phonemization it doesn't include a word cleaning, number expansion, text cleaning etc. Also espeak supports many languages many languages containing the same phoneme representation make multi-lingual training much easier.
@manmay-nakhashi ahh ok sounds good, i'll let the pull request stay open for a bit more for feedback while preparing the framework to accept the phonemized ids. thank you for the explanation!
Correct me if I'm wrong, but doesn't phonemizer also support multiple languages? And has espeak as it's backend. Wouldn't it be cleaner to have this. But still do the text cleaning and preprocessing first?
@nivibilla if we use different phonemizer from phonemizer library , your base char set (phoneme symbols ) will be different , so it's safe to use a single phonemization library which supports multiple language under same phoneme char set :thinking:
Ah yes that makes sense. I havent looked into the phonemizer library in depth for multiple languages so it might be that they use different character sets for different languages
@manmay-nakhashi are any of the packages being depended on in this pull request not MIT licensed?
@lucidrains all are either MIT or Apache 2.0
@manmay-nakhashi ok, i'll let this sit over the weekend and plan on merging it next Monday! thank you!
@manmay-nakhashi how many unique tokens does the phoneme tokenizer produce?
@lucidrains 122
_vowels = "iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻ"
_non_pulmonic_consonants = "ʘɓǀɗǃʄǂɠǁʛ"
_pulmonic_consonants = "pbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟ"
_suprasegmentals = "'̃ˈˌːˑ. ,-"
_other_symbols = "ʍwɥʜʢʡɕʑɺɧʲ"
_diacrilics = "ɚ˞ɫ"
_phonemes = _vowels + _non_pulmonic_consonants + _pulmonic_consonants + _suprasegmentals + _other_symbols + _diacrilics
@manmay-nakhashi thank you! will get this merged this evening
integrated espeak integrated tokenizer text_to_ids