graphicore / specimenTools

Apache License 2.0
29 stars 5 forks source link

How to do language coverage/Google fonts char set coverage reports right. #20

Open graphicore opened 7 years ago

graphicore commented 7 years ago

@davelab6 could you please read this and tell me if I got it right.

Our goal is basically that we can take a font and

a) state which languages are supported by the font b) tell the user which Google Fonts char sets are supported by the font c) write next to each Google Fonts char set which languages are supported by it. d) count the languages supported by the font and count the languages supported by each charset and get the same number for both.

Using the CLDR we can take a any Unicode char set and report the language coverage for that char set. A font is a char set and a Google Fonts char set of course is one as well (a language for this matter is also a char set).

b) and c) are important because the user will have to choose the char subset when embedding a font from Google Fonts. The Google Fonts API will subset the fonts using the Google Fonts char sets. So a font won't include more chars than in the Google Fonts char set. This is unless the subsetter decides a glyph with unicode outside the char set is needed, i.e. for OT-Features, but we can savely ignore this case.

Right now, what I did was getting a list of languages supported by each char set. Then get the char set support of a font. If a font supports a char set (fully) all languages of the char set are supported as well. Otherwise, neither the languages nor the char set are reported as supported.

I think this logic is flawed. It should rather be like this:

A char set is in the end just an instruction for the subsetter. It means, the font won't contain more chars than in the char set. Still, a font that contains less chars than in the char set can still be subsetted using that char set. Also, it can still contain enough chars to support some or all of the languages supported by the char set. I'd like to get the list of languages supported by the font. Then make the intersection of the languages supported by the char set and the languages supported by the font. If one or more languages are in the intersection, I'd like to report the languages as supported and the char set as supported. The latter only with the actually supported languages of course.

The semantics of this this would be: if you get the font using this Google Fonts char set, you'll get support for the following languages.

There are two upsides of this method:

  1. we can report more accurately which languages are supported
  2. we don't need the fonts to support all of the chars in a char set. This is important, because we have some fonts that don't include certain ligatures that are in GF-Latin-Plus, e.g. "Muli" misses:
    0xFB00 // LATIN SMALL LIGATURE FF
    0xFB03 // LATIN SMALL LIGATURE FFI
    0xFB04 // LATIN SMALL LIGATURE FFL

    None of these chars is so essential that we should discard the whole char set and with it a huge amount of supported languages. In fact, none of these chars are in any of the languages.

thlinard commented 7 years ago

Perhaps a good approach is to distinguish between basic language coverage and extended coverage. For example, GF Greek Core = everyday Modern Greek, all the other GF Greek = coverage beyond basic needs, without additional languages. GF Latin Pro, GF Latin Expert and GF Cyrillic Historical are in the same situation.

davelab6 commented 7 years ago

Our goal is basically that we can take a font and

a) state which languages are supported by the font

Yes

b) tell the user which Google Fonts char sets are supported by the font

Yes

c) write next to each Google Fonts char set which languages are supported by it.

Yes

d) count the languages supported by the font and count the languages supported by each charset and get the same number for both.

Yes

Using the CLDR we can take a any Unicode char set and report the language coverage for that char set.

Well, not quite; there are the unicodes, but there are also OT features + unencoded glyphs that are required for some languages. https://glyphsapp.com/tutorials/articles?q=localize

A font is a char set and a Google Fonts char set of course is one as well

Right

(a language for this matter is also a char set).

I don't think that's quite right :)

b) and c) are important because the user will have to choose the char subset when embedding a font from Google Fonts.

I don't follow you here; currently GF only has 'latin' and 'latin-ext' and the latter can include Plus, Pro, or Expert sets.

It would be good to check that latin-ext does cover everything in Plus, Pro, and Expert :)

The Google Fonts API will subset the fonts using the Google Fonts char sets.

Yes, for latin and latin-ext only

So a font won't include more chars than in the Google Fonts char set. This is unless the subsetter decides a glyph with unicode outside the char set is needed, i.e. for OT-Features, but we can savely ignore this case.

Right

Right now, what I did was getting a list of languages supported by each char set. Then get the char set support of a font. If a font supports a char set (fully) all languages of the char set are supported as well. Otherwise, neither the languages nor the char set are reported as supported.

I think this logic is flawed.

I think so

It should rather be like this:

A char set is in the end just an instruction for the subsetter. It means, the font won't contain more chars than in the char set. Still, a font that contains less chars than in the char set can still be subsetted using that char set. Also, it can still contain enough chars to support some or all of the languages supported by the char set. I'd like to get the list of languages supported by the font. Then make the intersection of the languages supported by the char set and the languages supported by the font. If one or more languages are in the intersection, I'd like to report the languages as supported and the char set as supported. The latter only with the actually supported languages of course.

The semantics of this this would be: if you get the font using this Google Fonts char set, you'll get support for the following languages.

OK

There are two upsides of this method:

  1. we can report more accurately which languages are supported
  2. we don't need the fonts to support all of the chars in a char set. This is important, because we have some fonts that don't include certain ligatures that are in GF-Latin-Plus, e.g. "Muli" misses:
    0xFB00 // LATIN SMALL LIGATURE FF
    0xFB03 // LATIN SMALL LIGATURE FFI
    0xFB04 // LATIN SMALL LIGATURE FFL

    None of these chars is so essential that we should discard the whole char set and with it a huge amount of supported languages. In fact, none of these chars are in any of the languages.

Yes, we need to distinguish between what is required for a language and what is merely recommended to have (like those ligatures)

davelab6 commented 7 years ago

Perhaps a good approach is to distinguish between basic language coverage and extended coverage

That's what I suggest to distinguish

graphicore commented 7 years ago

Well, not quite; there are the unicodes, but there are also OT features + unencoded glyphs that are required for some languages. https://glyphsapp.com/tutorials/articles?q=localize

Right. The CLDR has only Unicode chars for us. There's afaik no way to get localized substitutions from it.

I don't follow you here; currently GF only has 'latin' and 'latin-ext' and the latter can include Plus, Pro, or Expert sets.

OK. To paraphrase this: latin will be subsetted using "Latin-core" and latin-ext will be subsetted using "Latin-Plus", "Latin-Pro" or "Latin-Expert". So the subsetter will need to know which Namelist to use. Or could it also be that: "Latin-Plus" + "Latin-Pro" + "Latin-Expert" is used?

It would be good to check that latin-ext does cover everything in Plus, Pro, and Expert :)

How should this be done? As suggested before, using the union of all "Latin-{type}" namelists would kind of enforce this.?

file: tools/encodings/GF Glyph Sets/latin-ext.nam

#$ include GF-latin-core_unique-glyphs.nam
#$ include GF-latin-plus_unique-glyphs.nam
#$ include GF-latin-pro_unique-glyphs.nam
#$ include GF-latin-expert_unique-glyphs.nam
#$ include GF-latin-plus_optional-glyphs.nam
#$ include GF-latin-pro_optional-glyphs.nam

This is a very explicit example, could be done more implicit (since expert > pro > plus > core), using the inheritance mechanism:

#$ include GF-latin-expert_unique-glyphs.nam
#$ include GF-latin-plus_optional-glyphs.nam
#$ include GF-latin-pro_optional-glyphs.nam

(The optional glyph sets would be included for possible future compatibility, they don't contain any unicode chars yet.)

Yes, we need to distinguish between what is required for a language and what is merely recommended to have (like those ligatures)

At the moment, " to distinguish between what is required for a language and what is merely recommended " means for us required is a set of Unicode encoded glyphs that are in the CLDR. recommended are all other unencoded glyphs and encoded glyphs not in the CLDR.

Those ligatures are not required to have Unicode codepoints anymore. Because of: https://github.com/schriftgestalt/GlyphsInfo/issues/10#issuecomment-283217683 In fact, best practice seems to avoid these having Unicodes at all.

Figuring out language support including required OT-substitutions is pretty hard at the moment. This is because we don't have the data to begin with and this data can look differently for different authors supporting the same languages [1].

I remember that we had similar discussion when I was making the Arabic fonts, there's a lot encoded in Unicode for Arabic, but it's suggested to really only use the codes for the base glyphs. This is to kind of force the people into making simpler encoded texts and into using OpenType fonts. The Unicode encoded glyphs that we want to have are basically the chars that are entered via the keyboard.

This is also what the CLDR suggests. I.e. for Arabic (ar) the basic suggested chars (exemplar characters) are only 45 chars from 0x0621 to 0x0670 (skipping some codepoints and also not including punctuation and numerals here), this is only from the "Arabic" Unicode Range 0x0600–0x06FF and they explicitly exclude "presentation forms" in the documentation.

… It should not include presentation forms, like U+FE90 ( ‎ﺐ‎ ) ARABIC LETTER BEH FINAL FORM …

From the CLDR information, we would claim that the font supports Arabic, but we don't know if it really a) has all needed glyphs, b) uses these glyphs via GSUB substitutions in the right context/correctly. This includes also explicit localization glyphs with .locl{LANG} extenstions in their name, these have no codepoints as well.

Also interesting, GlyphsInfo encodes the example above:

<glyph unicode="FE90" name="beh-ar.fina" category="Letter" script="arabic" production="uniFE90" altNames="behfinalarabic" description="ARABIC LETTER BEH FINAL FORM" />

But I'm pretty sure we[2] wouldn't do it for a modern Arabic font. For Example take alif-type/reem-kufi of @khaledhosny The letter above has no unicode in its GLIF file arBeh.fina It's only used via GSUB and the fina feature. We can't detect this via CLDR and we probably shouldn't include Unicodes for such glyphs into our encodings.

Perhaps a good approach is to distinguish between basic language coverage and extended coverage

That's what I suggest to distinguish

So how to approach this approach, any ideas?

[1] I'd love to build a system that has a data-driven understanding of how the demands of a language should be implemented in fonts, and that can create and test fonts for these demands. I suggested this before. But that's not as easy as to check if some Unicode encoded charset correlates with some other Unicode encoded charset.

[2] I'm personally not sure if this kind of strict "this is 2017 and we don't support legacy Unicodes" approach is really a good idea. It could create a lot of pain when there are old bodies of texts. So at least having legacy-unicode versions for such fonts could be very helpful in some cases. Supporting the modern GSUB based style is possible with or without Unicode encoded glyphs anyways.

graphicore commented 7 years ago

Perhaps a good approach is to distinguish between basic language coverage and extended coverage

That's what I suggest to distinguish

So how to approach this approach, any ideas?

I see, there's a suggestion in the rest of the original comment of @thlinard

For example, GF Greek Core = everyday Modern Greek, all the other GF Greek = coverage beyond basic needs, without additional languages. GF Latin Pro, GF Latin Expert and GF Cyrillic Historical are in the same situation.

So, that's kind of fine for me.


The one thing that we need to agree upon:

Is our Unicode and CLDR based language coverage analysis sufficient for our needs or not?

Having the Unicode chars in a font that the CLDR suggests is a strong hint, but it doesn't mean that the font eventually really works for that language. There's no way to figure this out looking only at Unicode encoded glyphs.

khaledhosny commented 7 years ago

A note about Arabic, the legacy presentation forms were probably never used to encode text and store it that way, but rather they were used to store shaping logic in fonts before smart layout technologies became the norm. So that text would still be encoded using basic Arabic characters, and the layout layer would process the text to convert it to presentation forms and shows these glyphs from the font (the same effect that you can do today using, e.g., Arabic shaping functionality from FriBiDi).

So the utility of having these glyphs encoded is very minimal and limited only to systems that do shaping this way (they are pretty non-existing today) and not for displacing legacy encoded text. For some fonts (most of mine) the lack of OpenType shaping logic will completely break how the font is supposed to look, and legacy shaping will not help that much and that is why I just don’t bother with it.

Back to your original question, you can do some heuristics, like: Does the font has the minimal set of characters needed to support Arabic?, and if so: Does it also have GSUB feature with Arabic language system(s)? and do these feature include basic shaping ones like init, medi, and fina?. A weighted system among these lines would give you a good idea whether the font supports Arabic or not, though some fonts might do with less and others might need more. Similar things can done for other scripts/languages, but you will need to gather the data and set the criteria first.

graphicore commented 7 years ago

@khaledhosny Thanks a lot, very interesting!

but you will need to gather the data and set the criteria first.

As you said, it's still just "a good idea" of the language support and it will do wrong reports occasionally.