persephone-tools / persephone

A tool for automatic phoneme transcription
Apache License 2.0
156 stars 26 forks source link

Determine how to programmatically express phonemic segmentation for orthographies where greedy left-to-right segmentation is deficient because of ambiguities #127

Open oadams opened 6 years ago

oadams commented 6 years ago

Needs to be useable by a linguist. Talk to Alexis, Steven, Alex, Hywel and Rolando about this.

shuttle1987 commented 6 years ago

Is this the type of thing LabelSegmenter is supposed to deal with?

oadams commented 6 years ago

Yeah, that's right.

alexis-michaud commented 6 years ago

I never really dealt with issues of correspondence between orthography and pronunciation (graphophonemics) because I mostly work on languages without a written tradition. It seems clear that a full-fledged language model is necessary to disambiguate: English lead could be the metal or the verb 'to lead', with different pronunciations, etc. Tools such as NooJ do that very well, I'm told (in rule-based mode).

An orthography for Na is currently under development (well-advanced: working 'beta' version). Even though it's designed for the modern language (=without all the complexities due to language evolution since the orthography was created, as for English for instance), conversion to phonemic form would be highly problematic: the orthography only indicates very little information on tone, and there are dialect compromises to facilitate use by speakers from various areas. My intuition is that using orthographic input to Persephone would detract considerably from the quality of the model & output, as compared with phonemic input.

alexis-michaud commented 6 years ago

@nikopartanen noted that:

Cyrillic writing system in itself adds lots of redundancy and the grapheme-phoneme ratio is not ideal, so I assume that for Persephone more exact phoneme level could be more suitable

The tests that he carried out with Persephone (data: Komi language, Uralic) would seem to support this intuition of his (=that orthographic input won't do).