Open oadams opened 6 years ago
Is this the type of thing LabelSegmenter
is supposed to deal with?
Yeah, that's right.
I never really dealt with issues of correspondence between orthography and pronunciation (graphophonemics) because I mostly work on languages without a written tradition. It seems clear that a full-fledged language model is necessary to disambiguate: English lead could be the metal or the verb 'to lead', with different pronunciations, etc. Tools such as NooJ do that very well, I'm told (in rule-based mode).
An orthography for Na is currently under development (well-advanced: working 'beta' version). Even though it's designed for the modern language (=without all the complexities due to language evolution since the orthography was created, as for English for instance), conversion to phonemic form would be highly problematic: the orthography only indicates very little information on tone, and there are dialect compromises to facilitate use by speakers from various areas. My intuition is that using orthographic input to Persephone would detract considerably from the quality of the model & output, as compared with phonemic input.
@nikopartanen noted that:
Cyrillic writing system in itself adds lots of redundancy and the grapheme-phoneme ratio is not ideal, so I assume that for Persephone more exact phoneme level could be more suitable
The tests that he carried out with Persephone (data: Komi language, Uralic) would seem to support this intuition of his (=that orthographic input won't do).
Needs to be useable by a linguist. Talk to Alexis, Steven, Alex, Hywel and Rolando about this.