Need to perform tokenization in g2p conversions

joanise commented 3 years ago

When ReadAlong/Studio call g2p, it does so on tokenized text, so that each word is parsed as a single string, and ^ can match the beginning of the work, and $ the end of the word.

When g2p convert or convertextract are used, the input text (or maybe line) is passed as a whole, so that ^ and $ match the beginning and end of line, respectively, instead of the beginning of the word.

Affected mappings: Mappings mic/mic_to_ipa.json and fra/fra_to_ipa encode rules that are sensitive to the end or beginning of words, and only work correctly on single words. The same is true of mapping git/Orthography.csv and git/Orthography_Deterministic.csv, but those are not in use so they're not an issue.

Showing the problem:

$ g2p convert "sq" mic mic-ipa
səx
$ g2p convert "sq sq sq" mic mic-ipa
əsx sx səx

In the second command, the first word matches s in word-initial position, and third one matches q in word-final position, and the middle one matches neither. The correct output should have been səx səx səx.

A similar problem exists in French, where I tried to match spaces around the words as also marking beginning and end of words, but with a logic that fails to apply when there is punctuation present:

$ g2p convert "Ceci est un test test." fra fra-ipa
sʌsi ɛ œ̃ tɛ tʌst.

Although neither tɛ nor tʌst is great (tɛst would have been better), we would like the two to be mapped identically.

Possible solution:

In readalongs/text/tokenize_xml we have logic that tokenizes text along this rule:

a string is part of a token if it appears on the "in" side of any rule in its mapping to IPA
remaining characters are part of a token if they are Unicode types "letter", "number" or "diacritic"
everything else is not part of a token.

While this logic is necessary in readalongs, I think it could reasonably belong inside g2p, since it is tightly related to the g2p mappings.

Then, g2p convert, g2p scan, convertextract, etc, could all use the following algorithm, that readalongs align already effectively uses:

tokenize the input string into an alternative sequence of tokens and non tokens.
map all tokens in the sequence
print the mapped tokens and the unchanged non tokens in the original order they appeared

The benefit would be that applying a g2p mapping in any context would always produce the same output.

roedoejet commented 3 years ago

Thanks for this Eric. This has been bugging me for a while. I suppose we could have some default tokenization based on spaces for example but I would like to implement this with the ability to add a regex pattern to any given mapping (probably in the config.yml) to override the default tokenization. I'm struggling to think of a concrete example off-hand, but I'm almost certain there will be some cases where a default tokenization will be inappropriate for some orthographies.

joanise commented 3 years ago

You're welcome. It's been bugging me ever since I did the fra g2p!

I suggest we move classes Tokenizer and DefaultTokenizer almost as is into g2p initially, so that we can use the function tokenize_text(), leaving all the XML-related stuff in ReadAlongs.

If a given language has special tokenization rules, it should then only require a patch within g2p to handle it consistently for all g2p client code.

I think I'll start working on this fairly soon, because it's interfering with other work I'm doing. I'll see how the work goes and you can review it once I'm ready to submit the PR.

joanise commented 3 years ago

Something else semi-related: und only maps letters, not combining characters. I'm playing with a variant where I added a rule for each of the Unicode combining characters (0300-036F), mapping each to "". This lets und handle accented characters by mapping them to their unaccented variant, which is in keeping with the spirit of that mapping. Not related to this issue, but it came up in the same project that made me think about tokenization.

joanise commented 3 years ago

Meh, here's a gotcha: when a lang has lang -> lang-equiv -> lang-ipa, I would like to tokenize lang according to the union of the inventories of lang->lang-equiv and lang-equiv->lang-ipa. I'll have to think about how to solve that. ReadAlongs currently does not correctly handle this case, as far as I can tell, but I suspect we just haven't stumbled on an error caused by this yet.

joanise commented 3 years ago

Solved in PR #82 g2p convert now supports a --tok option to trigger tokenization make_g2p now supports two optional arguments, tok_lang and tok_path, to trigger tokenization

roedoejet / g2p

Need to perform tokenization in g2p conversions #79