baumwelchtrain optimization question

mmcauliffe commented 2 years ago

I'm trying to do some G2P training using some dictionaries that I've cleaned up from wikipron (https://github.com/MontrealCorpusTools/mfa-models/tree/main/dictionary). I'm currently stuck trying to train one for the Mandarin character dictionary (https://github.com/MontrealCorpusTools/mfa-models/blob/main/dictionary/mandarin_ipa.dict), where it just hangs on the first iteration. Cutting it down to a subset of ~3k words still takes 20 minutes, regardless of playing around with batch size, learning rate, and memory flags. Any other ideas for how to optimize it?

kylebgorman commented 2 years ago

One possible optimization is that you want to make it so that epsilon-epsilon mapping is impossible. What you do instead is you build a covering grammar such that an input sequence of length k (> 1) can go to an output sequence of length j (> 1) where j does not necessarily equal k.

I don't understand how this would ever work for Mandarin though. There's no real phonemic principle and the input vocabulary is enormous. It seems like a writing system which simply cannot be g2p'ed effectively.

mmcauliffe commented 2 years ago

Yeah, I think it'd end up just being a memorized table with pronunciations per character given how kanji are encoded. Theoretically you could break them down and there is some phonetic information to be gleaned in there, patterns like "if the kanji has the 生 radical in it, it's like to be pronounced as shen in Mandarin or sei in Japanese". I haven't found any decomposition like for what's supported for Korean unfortuantely, so probably not possible.

So I think what I'll probably do is just extract all the single character pronunciations and train a "G2P" model over that so that it's at least consistent with other languages. Should be small enough to not take forever, but also is going to miss any coarticulation stuff in compounds.

kylebgorman commented 2 years ago

People use the cangjie decomposition for this sort of thing. I don't know enough to know if it makes sense here:

https://en.wikipedia.org/wiki/Cangjie_input_method

On Sat, Dec 18, 2021 at 6:34 PM Michael McAuliffe @.***> wrote:

Yeah, I think it'd end up just being a memorized table with pronunciations per character given how kanji are encoded. Theoretically you could break them down and there is some phonetic information to be gleaned in there, patterns like "if the kanji has the 生 radical in it, it's like to be pronounced as shen in Mandarin or sei in Japanese". I haven't found any decomposition like for what's supported for Korean unfortuantely, so probably not possible.

So I think what I'll probably do is just extract all the single character pronunciations and train a "G2P" model over that so that it's at least consistent with other languages. Should be small enough to not take forever, but also is going to miss any coarticulation stuff in compounds.

— Reply to this email directly, view it on GitHub https://github.com/kylebgorman/pynini/issues/52#issuecomment-997306025, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OKUMLZTAIJBYV4JFZTURUSA7ANCNFSM5KK2I5WA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

mmcauliffe commented 2 years ago

Ah perfect, thanks for the reference, I'll do some research and maybe try out whether decomposing can lead to decent pronunciation models, but I'll close this issue.

kylebgorman / pynini

baumwelchtrain optimization question #52