Closed joanise closed 3 years ago
Thanks for this Eric. This has been bugging me for a while. I suppose we could have some default tokenization based on spaces for example but I would like to implement this with the ability to add a regex pattern to any given mapping (probably in the config.yml) to override the default tokenization. I'm struggling to think of a concrete example off-hand, but I'm almost certain there will be some cases where a default tokenization will be inappropriate for some orthographies.
You're welcome. It's been bugging me ever since I did the fra g2p!
I suggest we move classes Tokenizer
and DefaultTokenizer
almost as is into g2p initially, so that we can use the function tokenize_text()
, leaving all the XML-related stuff in ReadAlongs.
If a given language has special tokenization rules, it should then only require a patch within g2p to handle it consistently for all g2p client code.
I think I'll start working on this fairly soon, because it's interfering with other work I'm doing. I'll see how the work goes and you can review it once I'm ready to submit the PR.
Something else semi-related: und
only maps letters, not combining characters. I'm playing with a variant where I added a rule for each of the Unicode combining characters (0300-036F), mapping each to "". This lets und
handle accented characters by mapping them to their unaccented variant, which is in keeping with the spirit of that mapping. Not related to this issue, but it
came up in the same project that made me think about tokenization.
Meh, here's a gotcha: when a lang has lang
-> lang-equiv
-> lang-ipa
, I would like to tokenize lang
according to the union of the inventories of lang->lang-equiv
and lang-equiv->lang-ipa
. I'll have to think about how to solve that. ReadAlongs currently does not correctly handle this case, as far as I can tell, but I suspect we just haven't stumbled on an error caused by this yet.
Solved in PR #82
g2p convert
now supports a --tok
option to trigger tokenization
make_g2p
now supports two optional arguments, tok_lang
and tok_path
, to trigger tokenization
When ReadAlong/Studio call g2p, it does so on tokenized text, so that each word is parsed as a single string, and
^
can match the beginning of the work, and$
the end of the word.When
g2p convert
orconvertextract
are used, the input text (or maybe line) is passed as a whole, so that^
and$
match the beginning and end of line, respectively, instead of the beginning of the word.Affected mappings: Mappings
mic/mic_to_ipa.json
andfra/fra_to_ipa
encode rules that are sensitive to the end or beginning of words, and only work correctly on single words. The same is true of mappinggit/Orthography.csv
andgit/Orthography_Deterministic.csv
, but those are not in use so they're not an issue.Showing the problem:
In the second command, the first word matches
s
in word-initial position, and third one matchesq
in word-final position, and the middle one matches neither. The correct output should have beensəx səx səx
.A similar problem exists in French, where I tried to match spaces around the words as also marking beginning and end of words, but with a logic that fails to apply when there is punctuation present:
Although neither
tɛ
nortʌst
is great (tɛst
would have been better), we would like the two to be mapped identically.Possible solution:
In readalongs/text/tokenize_xml we have logic that tokenizes text along this rule:
While this logic is necessary in readalongs, I think it could reasonably belong inside g2p, since it is tightly related to the g2p mappings.
Then,
g2p convert
,g2p scan
,convertextract
, etc, could all use the following algorithm, thatreadalongs align
already effectively uses:The benefit would be that applying a g2p mapping in any context would always produce the same output.