Better font subset detection

m4rc1e commented 2 months ago

We currently detect script subsets such as Arabic by counting the number of glyphs in the cmap table and then seeing if greater than 50% of them are in a specific script subset .nam file. Instead, what if we simply checked if a font fulfills a gflanguage base charset e.g for Arabic, we'd need all the base characters in https://github.com/googlefonts/lang/blob/main/Lib/gflanguages/data/languages/ar_Arab.textproto#L44?

simoncozens commented 2 months ago

That sounds a lot better, although we need to think it through:

A script like Arabic is used to write multiple languages - we don't have a concept of which is the "primary" language for the script at the moment, so do we require a superset of all of them?
Perhaps not, because a font which supports Arabic shouldn't be skipped just because it doesn't contain Uyghur letter ۇ, and a font can support Latin even if it doesn't have Ᵽ.
So are we back to looking at support for a certain percentage of covered characters within all languages of a script?
Or can we require the intersection of all of the bases for all languages for a script? Nice idea, but the intersection for Latin is the null set - there are no characters which are an exemplar for every Latin language!

I can see two approaches:

YOLO. If a font contains any character in a subset, it gets that subset. Anything less than that and you're removing glyphs from a font, and why do we want to do that? Or
Compute the supported languages first, and then declare support for all of their scripts.

m4rc1e commented 2 months ago

Compute the supported languages first, and then declare support for all of their scripts.

I like this a lot. gflanguages also includes population count data. Perhaps instead of counting glyphs, counting how many people you're able to cover may be better.

googlefonts / gftools

Better font subset detection #982