In our system, an Inuktitut word is made up of SyllabicUnits. These syllabic units are each associated with a set of dialects the unit could occur in, for example "ai" is seen in all dialects, but "ᕠ" is only seen in Nattilik. As of writing, there are only SyllabicUnits associated with all dialects or a single dialect, though it's possible that in the future, some units will be associated with only, say, 2 given dialects.
From this, we should be able to determine the potential dialect set of the word. I think it should be implemented like this: all SyllabicUnits in a word will have their dialect sets unioned together, and the result will be that word's dialect. The word "inuit", for example, would still have a dialect set containing all dialects, whereas "ᐃᓅ𑪰ᒻᓂᒃ" would be Nattilik, due to the "𑪰" character (from [1]).
In the case where two sets are incompatible, as in "𑪰ᖅᒃ", which contains two SyllabicUnits, one that only occurs in Nattalik, and one that only occurs in Nunavut, I think it's safe to mark these as "unknown". These could be shown in a bright colour in the web extension's (future) debug mode, to help refine the SyllabicUnit classifications.
How then to encode this set of possibilities? I think that this is still logically a set, but with the exception of "unknown". In Rust, this should probably be represented as an optional set (Option<enum_set>). If we're going to chunk the Inuktitut text into larger logical units in the DOM, I may have to figure out a way to float the InuktitutWord and optional set up to JS, though this shouldn't be too difficult and is out of scope for the moment.
In our system, an Inuktitut word is made up of
SyllabicUnit
s. These syllabic units are each associated with a set of dialects the unit could occur in, for example "ai" is seen in all dialects, but "ᕠ" is only seen in Nattilik. As of writing, there are onlySyllabicUnit
s associated with all dialects or a single dialect, though it's possible that in the future, some units will be associated with only, say, 2 given dialects.From this, we should be able to determine the potential dialect set of the word. I think it should be implemented like this: all
SyllabicUnit
s in a word will have their dialect sets unioned together, and the result will be that word's dialect. The word "inuit", for example, would still have a dialect set containing all dialects, whereas "ᐃᓅ𑪰ᒻᓂᒃ" would be Nattilik, due to the "𑪰" character (from [1]).In the case where two sets are incompatible, as in "𑪰ᖅᒃ", which contains two
SyllabicUnit
s, one that only occurs in Nattalik, and one that only occurs in Nunavut, I think it's safe to mark these as "unknown". These could be shown in a bright colour in the web extension's (future) debug mode, to help refine theSyllabicUnit
classifications.How then to encode this set of possibilities? I think that this is still logically a set, but with the exception of "unknown". In Rust, this should probably be represented as an optional set (
Option<enum_set>
). If we're going to chunk the Inuktitut text into larger logical units in the DOM, I may have to figure out a way to float theInuktitutWord
and optional set up to JS, though this shouldn't be too difficult and is out of scope for the moment.