googlefonts / shaperglot

Test font files for language support
Apache License 2.0
29 stars 4 forks source link

Initial README content #3

Closed twardoch closed 1 year ago

twardoch commented 2 years ago

Problem

When you choose a font to typeset some text, the very first question that interests you is: which fonts support the language(s) of my text? A font that doesn’t support the languages won’t be of any interest.

But what does it mean, exactly, that a font supports a given language? For Latin-script fonts, the task is reasonably easy and mostly equals to: does the font have glyphs for all the Unicode codepoints used by the language? In reality, this isn’t always so trivial either. To typeset text that is written in English, it’s not enough that the font has glyphs for the A-Z and a-z letters. It also needs digits, and some punctuation. Well, it also probably needs some accented letters, because you may want to write the names Chloë or Brontë, for example.

But it’s still a relatively easy task to check. The [Unicode CLDR]() project collects “exemplar characters” several categories. If you check if the font contains glyphs for all these characters, you can say, “OK, this font supported this language”. The Rosetta Type Hyperglot project contains similar information, with some annotations.

Rationale behind Shaperglot

But this approach does not work for scripts that need “shaping”, a process that maps the input Unicode codepoints of the text into a series of glyphs in a way which is not a 1:1 correspondence. For scripts like Arabic or Devenagari, it’s not enough to check if the font has default glyphs for all Unicode codepoints from some set. You also need to check if the font has some rules (features) that perform the shaping so that the final rendered text is orthographically correct.

Shaperglot allows to check for the Unicode coverage, but also allows other tests. In particular, the idea is that:

The fact that a change happened indicates that there is some support for a language beyond just the Unicode codepoint coverage.

For example, if I put the default i and apply the locl feature with the script tag latn and the language tag TRK, and I see that the output glyph (or series) is different than the input, I can say with higher certainty “this font supports Turkish”.

Shaperglot will not (yet ;) ) use computer vision to judge the quality of the change, but it’s based on a very reasonable assumption that if I put in some letter and ask HarfBuzz to apply a certain feature, and the result as the same as the input, then it means that the feature is not meaningfully implemented, hence there is a problem.

The advantage of using Shaperglot approach is that the tests can be complex. Sometimes, the meaningful change will come about only in a combination of certain features, not just one feature. Or maybe an alternative (some fonts may implement something via liga, some others may implement the same via ccmp or calt). So the test may ask for all 3 features to be applied and check if something changed.

Shaperglot has example implementations of tests for some languages, but needs more data.

In future, additional, more sophisticated tests, can be implemented. Test-driven development can help to have better fonts, but also can help to get better info about language support.

simoncozens commented 1 year ago

Thanks for this. I wrote something similar independently, and (finally) integrated it.