Dealing with marks in the data

rosettatype / hyperglot

Hyperglot: a database and tools for detecting language support in fonts

http://hyperglot.rosettatype.com

GNU General Public License v3.0

166 stars 23 forks source link

Dealing with marks in the data #28

Closed MrBrezina closed 3 years ago

MrBrezina commented 3 years ago

The intention is to include combining marks that are used in canonical decomposition defined by Unicode for the characters used by an orthography. The current plan (not implemented yet) would work like this:

for example, if base contains š this will imply inclusion of ◌̌ (combining caron) in base_marks
for example, if auxiliary contains á this will imply inclusion of ◌́ (combining acute) in auxiliary_marks
a flag will be provided for the CLI tool and methods in the Python package and a toggle will be added to the web app to not check for marks when detecting language support

The reason for this is that some fonts that do not cover code points for combining marks might not be recognized as supporting a language even though the font could be used for the language without any problem thanks to the characters for precomposed combinations. We are not able to evaluate the quality of the mark positioning in a font.

Proposition: In situations where all combinations in base are covered without marks, with precomposed characters only, we could include the combining marks in auxiliary_marks only rather than in base_marks. The detection would check, by default, only base and base_marks. This way, languages would still be detected (without any flag or toggle use) and the marks will be noted.

We think that the inclusion of combining marks in fonts is: a) more future-proof and better design strategy, b) useful for technical reasons when precomposed characters get decomposed automatically in some situations.

moyogo commented 3 years ago

One of the issue with the current approach or the proposed approach is that the presence of a combining mark does not guarantee that graphemes of an orthography that don't have precomposed character forms are supported. The data model does not store which characters combining marks are meant to combine with, which prevents from testing them in hyperglot or in tools using hyperglot.

For example, Guarani uses ã ẽ g̃ ĩ õ ũ ỹ, where all but g̃ have precomposed characters encoded. A font that has combining tilde does not necessarily support Guarani properly unless it positions the combining tilde on g or substitutes the sequence for a single glyph.

A first step to resolve this would be to store these graphemes in base or auxiliary. Then a check that verifies these sequences are modified either with positioning or substitution by default features in the relevant language system could confirm that the font has some support for those graphemes. That said there may be cases where positioning or substitution are not required for support.

The quality of support is probably out of scope, the same way it is for simple graphemes, meaning if the positioning is different than if there was no positioning but is still incorrect is no different than if the position was wrong in a precomposed character. This would have to be assessed differently.

MrBrezina commented 3 years ago

Originally, we wanted to include the combinations and I wanted to check for presence of relevant OpenType features and lookups. I think we only postponed the combinations for later. I am definitely for their inclusion in base and auxiliary or possibly in a separate combinations entry or even separate database if it gets too large (Indian languages). What do you think @kontur ?

Regarding OT feature checks, I am not convinced that it makes sense to do an elaborate check that would not completely tackle the issue (be it OT feature analysis or ML-driven OCR on test documents). The ultimate check has to be visual and according to the objectives of the designer and the purpose of what they are trying to do.

For example:

If we attempted to analyse the existence of mark positioning, we would not know if it is any good. And the mark could be well positioned, but poorly designed.
If we attempted to analyse the quality of mark positioning, we would be tackling an artificial design problem. We would need to know what constitutes a good positioning. That is not far from making good positioning and comparing with the font. And that way we would be imposing our design standards (embedded in the judgement of what constitutes a “good positioning”) on others.
If we compared a combination before and after activating OT features we would not be much wiser. I know people who design their marks to overhang above characters so they do not need to rely on mark positioning. But maybe that is a rare case.

I think I drew the line around design differently from you and included the quality of OpenType features in it. Thus, my current inclination is to trust (!) the users and provide general (format agnostic :) ) notes and perhaps point them to design guidelines where these are available.

[editted for better clarity]

moyogo commented 3 years ago

I agree, OT feature check is a difficult one. Knowing which combinations occur is the minimum that's currently missing.

kontur commented 3 years ago

Yes, I think this is one of the edge cases where conceptually the decomposition approach does not work well. Even more so, if the unencoded glyph, let's say /gtilde/, is in a charset in which no other encoded glyphs that can be decomposed to base + /tildecomb/ is present. I think we could simply tolerate such unencoded base + mark combinations in the data, but it gets a little tricky with the parsing and saving of that data. Normally, we split everything into unicodes that can be split, alas we'd somehow have to implement "keeping" those combinations joined.

I think a better implementation for support checking would be indeed to check the mark features and make sure the glyphs involved are listed. This would be a solution that side steps design considerations, but also detects false positives of the current approach if the involved glyphs just happen to be in the font, but are not linked via the mark feature at all.

For now a good step will be to split to base, base_marks, auxiliary and auxiliary_marks, and an implementation to test for those in the CLI.

meehkal commented 3 years ago

Just a note that this is also relevant for Kildin Sami (sjd), where some of the following letters of the alphabet are precomposed, but not all of them. (The list is not complete, but just for illustration.)

А̄ Е̄ Ӣ Э̄ Я̄ Ӯ Ё̄

kontur commented 3 years ago

For the record in this discussion, we've just released 0.3.0 which addresses these issues with marks in a different way. Characters in list no longer are required to be only encoded unicode characters, but also unencoded base + mark combinations can be stores.

marks will still be extracted from base and auxiliary and saved in marks, but marks can further list other marks the language requires which are not part of the character lists. The CLI now requires the --marks option to explicitly check for the presence of combining marks. --decompose works as before.

More details in the changelog

We've done some review of orthographies where the data suggested some unencoded combinations have previously gotten dropped, but @meehkal and @moyogo if you have languages/orthographies in mind where this was an issue feel free to point them out to us or submit a fix — I hope the way the character list storing is now less unicode-centric and more like linguists might think about orthographies eases the readability and input of language data.