Open NeilSureshPatel opened 2 years ago
An overlapping issue related by Dan Burzo:
I was looking at Romanian, wondering about the necessity of combining marks as independent codepoints to declare it supported. As long as “Ă” and “ă” exist, a combining breve is just a nice to have (maybe to form the historical ĕ, ĭ, ŭ)?
Our marks
entry is underspecified: does it means the marks you need to form characters, or marks which can attach to a variety of bases? For Romanian, it's the former: we ask for '◌̂', '◌̆', '◌̦', '◌̧'
but only because we have base characters which already contain those marks. And so this field is redundant data: just decompose the base characters into NFD, and there your marks are. But for Arabic it's the latter: we ask for '◌ٰ', '◌ٓ', '◌ٔ', '◌ٕ', '◌ً', '◌ٌ', '◌ٍ', '◌َ', '◌ُ', '◌ِ', '◌ّ', '◌ْ'
which can sit on top of any base consonant. This is new data since it can't be derived from the bases.
I think we probably want to move towards the latter interpretation: "marks" are any independent combining marks that you need to support the language.
As we prepare implementing shaperglot for testing African language support, I am noticing that there is variation in the way language orthographies are incorporated in gflang. Here are few examples:
bas_Latn
bin_Latn
af_Latn
The first inconsistency is that not all language profiles contain auxiliary bases when they should. When auxiliary bases include a mark the mark list doesn't always include those accents.
The second big inconsistency is the inclusion of non-precomposed base/mark pairs in the base list. Sometimes these pairs are in base list and sometimes they are not.
In order for shaperglot to properly parse gflang to run its orthography tests we need some consistency in how the exemplar character lists are constructed. For the purposes of shaperglot, it is good to have gflang contain all necessary base/mark pairs regardless if they can be precomposed or not. It appears like the variation is caused by the incoming source data. (The bas_Latn entry reflects the data in CLDR, including the lack of spaces between certain bases.) Should we have a guideline specifically spells out what needs to be included in bases, auxiliary, and marks?
Perhaps something like: -bases: all primary characters of a language including precomposed base/mark pairs and non-composed base/mark pairs, when a precomposed character is not encoded -auxiliary: all secondary characters of a language including precomposed base/mark pairs and non-composed base/mark pairs, when a precomposed character is not encoded -marks: all standalone marks whether they are primary or auxiliary