Closed MrBrezina closed 3 years ago
One of the issue with the current approach or the proposed approach is that the presence of a combining mark does not guarantee that graphemes of an orthography that don't have precomposed character forms are supported. The data model does not store which characters combining marks are meant to combine with, which prevents from testing them in hyperglot or in tools using hyperglot.
For example, Guarani uses ã ẽ g̃ ĩ õ ũ ỹ, where all but g̃ have precomposed characters encoded. A font that has combining tilde does not necessarily support Guarani properly unless it positions the combining tilde on g or substitutes the sequence for a single glyph.
A first step to resolve this would be to store these graphemes in base or auxiliary. Then a check that verifies these sequences are modified either with positioning or substitution by default features in the relevant language system could confirm that the font has some support for those graphemes. That said there may be cases where positioning or substitution are not required for support.
The quality of support is probably out of scope, the same way it is for simple graphemes, meaning if the positioning is different than if there was no positioning but is still incorrect is no different than if the position was wrong in a precomposed character. This would have to be assessed differently.
Originally, we wanted to include the combinations and I wanted to check for presence of relevant OpenType features and lookups. I think we only postponed the combinations for later. I am definitely for their inclusion in base
and auxiliary
or possibly in a separate combinations
entry or even separate database if it gets too large (Indian languages). What do you think @kontur ?
Regarding OT feature checks, I am not convinced that it makes sense to do an elaborate check that would not completely tackle the issue (be it OT feature analysis or ML-driven OCR on test documents). The ultimate check has to be visual and according to the objectives of the designer and the purpose of what they are trying to do.
For example:
I think I drew the line around design differently from you and included the quality of OpenType features in it. Thus, my current inclination is to trust (!) the users and provide general (format agnostic :) ) notes and perhaps point them to design guidelines where these are available.
[editted for better clarity]
I agree, OT feature check is a difficult one. Knowing which combinations occur is the minimum that's currently missing.
Yes, I think this is one of the edge cases where conceptually the decomposition approach does not work well. Even more so, if the unencoded glyph, let's say /gtilde/, is in a charset in which no other encoded glyphs that can be decomposed to base + /tildecomb/ is present. I think we could simply tolerate such unencoded base + mark combinations in the data, but it gets a little tricky with the parsing and saving of that data. Normally, we split everything into unicodes that can be split, alas we'd somehow have to implement "keeping" those combinations joined.
I think a better implementation for support checking would be indeed to check the mark features and make sure the glyphs involved are listed. This would be a solution that side steps design considerations, but also detects false positives of the current approach if the involved glyphs just happen to be in the font, but are not linked via the mark feature at all.
For now a good step will be to split to base
, base_marks
, auxiliary
and auxiliary_marks
, and an implementation to test for those in the CLI.
Just a note that this is also relevant for Kildin Sami (sjd), where some of the following letters of the alphabet are precomposed, but not all of them. (The list is not complete, but just for illustration.)
А̄ Е̄ Ӣ Э̄ Я̄ Ӯ Ё̄
For the record in this discussion, we've just released 0.3.0
which addresses these issues with marks
in a different way. Characters in list no longer are required to be only encoded unicode characters, but also unencoded base + mark combinations can be stores.
marks
will still be extracted from base
and auxiliary
and saved in marks
, but marks
can further list other marks the language requires which are not part of the character lists. The CLI now requires the --marks
option to explicitly check for the presence of combining marks. --decompose
works as before.
More details in the changelog
We've done some review of orthographies where the data suggested some unencoded combinations have previously gotten dropped, but @meehkal and @moyogo if you have languages/orthographies in mind where this was an issue feel free to point them out to us or submit a fix — I hope the way the character list storing is now less unicode-centric and more like linguists might think about orthographies eases the readability and input of language data.
The intention is to include combining marks that are used in canonical decomposition defined by Unicode for the characters used by an orthography. The current plan (not implemented yet) would work like this:
base
containsš
this will imply inclusion of◌̌
(combining caron) inbase_marks
auxiliary
containsá
this will imply inclusion of◌́
(combining acute) inauxiliary_marks
The reason for this is that some fonts that do not cover code points for combining marks might not be recognized as supporting a language even though the font could be used for the language without any problem thanks to the characters for precomposed combinations. We are not able to evaluate the quality of the mark positioning in a font.
Proposition: In situations where all combinations in
base
are covered without marks, with precomposed characters only, we could include the combining marks inauxiliary_marks
only rather than inbase_marks
. The detection would check, by default, onlybase
andbase_marks
. This way, languages would still be detected (without any flag or toggle use) and the marks will be noted.We think that the inclusion of combining marks in fonts is: a) more future-proof and better design strategy, b) useful for technical reasons when precomposed characters get decomposed automatically in some situations.