Clarify how Hyperglot handles composed and decomposed combinations, e.g. accented letters

rosettatype / hyperglot

Hyperglot: a database and tools for detecting language support in fonts

http://hyperglot.rosettatype.com

GNU General Public License v3.0

162 stars 22 forks source link

Clarify how Hyperglot handles composed and decomposed combinations, e.g. accented letters #149

Open MrBrezina opened 7 months ago

MrBrezina commented 7 months ago

@moyogo made a the following comment in #147:

maybe hyperglot should automatically have a general note for when graphemes can be both composed or decomposed, instead of adding such a note to every single language where this does or can occur.

I think it makes sense to clarify this in the README, web app, and maybe even in CLI. But I wonder if it would be best as a point in some kind of language support checklist.

The handling could be potentially different for each language: requiring a combining dieresis for German is optional while for Tlingit both <Ḵ> and <ḵ> should be supported as precomposed and using combining marks (see #147). Perhaps this could be a flag for orthographies, @kontur ?

moyogo commented 6 months ago

Note that for German, while it is not common, one may still get composed characters when dealing with files on the macOS file system as strings are stored in something close to NFD rather than NFC. Usually the UI file dialog or Finder normalize strings when names are copied but the name can still be in decomposed forms if obtained otherwise, like in some applications or when files are copied to some other operating systems.

One can also easily stumble upon decomposed forms in library catalogues. For example in the LOC catalogue or the NYPL catalogue.

The nature of the Unicode model means these decomposed forms should also be supported in German, even if they are less common in the large corpus of German data.

kontur commented 6 months ago

I think there is a difference between being able to represent an orthography accurately and being able to represent all legacy input sequences of an orthography (such as encountered in digital texts, which is what the PR comment we concerned with). We have --marks and --decomposed flags for those nuances should someone want to check a font against them, specifically. I don't see this as something that inherently needs differentiating in the language data. Quite the opposite, all data in Hyperglot is saved in Composed form where such a form exists, which makes distinguishing required marks possible from combinations that have no Composed form and thus require the base + mark explicitly.

If this were an issue, this would apply to all orthographies and all characters that have decomposable unicodes.

MrBrezina commented 6 months ago

@moyogo I agree with that, but what you describe sounds like a recommended best practice, not a minimal requirement for language support (good-enough practice). We do not want to fail detecting a font if it is good-enough. Minimality is a key notion in Hyperglot. (I would have loved to call it a principle, but I am unable to support it with a clear definition.)

Frankly, I forgot about the global switch for the CLI when I wrote the issue, but the issue still stands. For some languages supporting decomposed solution may be an essential feature, for others it seems non-essential. In theory at least. In order to add a note in the README and elsewhere I would like to clarify our position.

Sorry, for the latency in my replies.

kontur commented 6 months ago

From the README:

-m, --marks: Flag to signal a font should also include all combining marks used for a language - by default only those marks are required which are not part of preencoded characters (default is False)
-d, --decomposed: Flag to signal a font should be considered supporting a language as long as it has all base glyphs and marks to write a language - by default also encoded precomposed glyphs are required (default is False)

I think changing the --decomposed default to True would give the broadest results. Just to give a rough idea, Rosetta's Adapter PE supports 398 languages in default detection, 418 with --decomposed. I think the original argument, which is still valid, is that the mere presence of the combining characters is not enough to be certain the composites are working, e.g. have mark attachment points. If we went to check base + mark combinations for anchors, or check actual shaping happens, it would make sense to change the default here then.

kontur commented 5 months ago

I consider this clarified :)

MrBrezina commented 5 months ago

This needs to be clarified in the web app about still.

kontur commented 1 week ago

Idea: For the CLI, output a short preamble before the test result that clarifies marks, decomposition and shaping checks, as well as opt-in flags, and how they affect the result.