Split kerning by script, not by direction (second attempt)

madig commented 2 years ago

This pulls the commits for #636 and some follow-ups aside, to unblock releases from main until the code is more thoroughly tested.

Relevant issues:

https://github.com/googlefonts/ufo2ft/issues/642
https://github.com/googlefonts/ufo2ft/issues/658 (maybe)
https://github.com/googlefonts/ufo2ft/pull/635 (indirectly, because it stumbles over the same issue of what to do when not all sources have the same kerning information and you try to compile a varaible font)

Todo:

[x] Track down the issue in #658
[x] Do a bit of before-after compilation of single UFOs to compare their kerning and see if splitting by script introduces differences when it shouldn't.
[ ] Figure out what would need to be done to make variable fonts compile and behave correctly when not all kerning pairs are defined across all sources. The issue already existed before, but since we were splitting by direction, which is a much coarser property, it rarely was problematic. separate issue

Closes https://github.com/googlefonts/ufo2ft/issues/658

madig commented 2 years ago

While investigating https://github.com/googlefonts/ufo2ft/issues/658, I stumbled over the following:

The splitter stuffs foreign script kern pairs into lookups. E.g. lookup_Latn may suddenly contain both pos [Y ...] [A ...] and pos [Upsilon ...] [A ...] because the original pos @kern1.Y @kern2.A that contained their union was split up into the same lookup. I'd assume they should land in different ones?
The lookup statements are unsorted, potentially triggering a bug in https://github.com/fonttools/fonttools/blob/64fd837ca1c2669f7e0e9d41f47b376e3fa153c3/Lib/fontTools/otlLib/builder.py#L1252-L1277, where a subtable break is inserted before a value for a class pair is declared but after the classdefs are associated with the subtable already, leading to a kerning value of zero entry being inserted in the earlier subtable (because https://github.com/fonttools/fonttools/blob/64fd837ca1c2669f7e0e9d41f47b376e3fa153c3/Lib/fontTools/otlLib/builder.py#L2131 is None, None), while the next one has the correct one, but is ignored by shapers. Additionally, sorting the statements manually shaved 10 KB off the font size on my test font source. Reverting to fontTools 4.37.1 (before https://github.com/fonttools/fonttools/pull/2776) does not help.
https://github.com/fonttools/fonttools/issues/2793 says there may be bigger problems lurking?!

madig commented 2 years ago

So, in the test font provided by yanone, the feature file does not declare languagesystem grek dflt; explicitly in the features.fea file, but the kern writer uses these declarations to sort kern pairs into script buckets. Greek glyphs are then considered script-less, which is wrong, and they get dumped into random other lookups. Uh. Should the writer even do this when kerning typically touches more scripts that GSUB? I'd derive the scripts in use purely from the glyph set? @simoncozens

madig commented 2 years ago

Side-note: the code in https://github.com/googlefonts/ufo2ft/blob/58d8b2bfa0f337aa529cf21bbebffbfd61dcbfeb/Lib/ufo2ft/featureWriters/kernFeatureWriter.py#L369 sorts the kerning pairs by type, but this is later undone in https://github.com/googlefonts/ufo2ft/blob/58d8b2bfa0f337aa529cf21bbebffbfd61dcbfeb/Lib/ufo2ft/featureWriters/kernFeatureWriter.py#L501 when splitting a pair can result in different types of kerning pairs. The sorting can probably be moved down to the split kern emitter.

madig commented 2 years ago

From call:

Split kerning pairs by script into buckets
Use fontTools.misc.classifyTools within each bucket to make disjoint kern1 classes so that coverage per PairPos subtable is unique (to avoid potential fontTools otlLib subtable break bug).
1. Collect all kern1 classes from the bucket, pass them to classifyTools
2. Get back a mapping of {glyph: {disjoint set for that glyph}}
3. For each "pos kern1 kern2 value;" in the bucket:
  1. smaller_kern1s = group_by(kern1, glyph => mapping[glyph])
  2. emit ["pos smaller_kern1 kern2 value;" for each smaller_kern1 in smaller_kern1s]
Sort the whole bucket to have:
- all the enum pos and glyph-to-glyph pos first, then the class-to-class pos second;
- the "pos small_kern1" with the same kern1 next to each other

madig commented 2 years ago

Also, the PairPos coverage must be unique inside a lookup, otherwise things are unreachable. That's an invariant to follow.

madig commented 2 years ago

So, regarding https://github.com/googlefonts/ufo2ft/pull/667#issuecomment-1281381467, we either:

Tell the designer to add a Greek languagesystem, or
Insert the lookup under DFLT (if exists, otherwise error?), or
Just use the script in the lookup even when the languagesystem is not declared.

Maybe I'll try the latter and see what happens in various shaping engines.

anthrotype commented 2 years ago

the feature file does not declare languagesystem grek dflt; explicitly in the features.fea file, but the kern writer uses these declarations to sort kern pairs into script buckets. Greek glyphs are then considered script-less, which is wrong, and they get dumped into random other lookups.

This thing of using the globally declared default languagesystems pre-dates the split-kerning-by-scripts PR, but maybe doesn't make sense any more and should be revised.

In the previous code in which we'd only split lookups between LTR/RTL, the default languagesystems were parsed and grouped into buckets depending on whether they were associated with "kern" or "dist", and on the script's horizontal writing direction (LTR/RTL), stored in self.context.scriptGroups dict of dicts (see _groupScriptsByTagAndDirection)

Then, if there was no globally declared script associated with "dist" feature, none would be generated.

Then, kerning pairs would be split into three lookups: LTR, RTL and a kern_dflt for pairs whose glyphs' script/bidi properties don't fit either LTR or RTL (e.g. punctuation, numbers, etc.). Finally, when deciding under which script/language/feature (kern vs dist) to register each of these lookups, again these script groupings derived from the global default languagesystems would be used (e.g. see _registerKernLookups). E.g. if kerning.plist contains pairs between LTR glyphs but default languagesystems only declares RTL scripts and no LTR ones, then the kern_ltr lookup would be registered under DFLT/dflt.

What we could have done maybe was to register such kern_ltr, not under DFLT/dflt, but under all the LTR scripts that are actually used therein e.g. by adding script grek; and such statements, even if they are not globally declared in the default languagesystems (which we know it's ok). I'm honestly not sure and can't remember why we didn't actually do that... Maybe we didn't feel like we could safely infer stuff and preferred to leave it to the font developer to decide about. Or maybe it was just simpler and OK enough that way.

Even so, imposing the font developer to write explicit default languagesystems to control the behavior of the kern feature writer may not always be desirable, as those apply indiscriminately to both GPOS and GSUB. In the example brought forward by Nikolaus, the GSUB features work fine without Greek being declared among the default languagesystems, but GPOS kerning may need those otherwise ufo2ft places greek kerning under DFLT/dflt; but if greek were added to default languagesystems then the meaning of all the pos/sub rules elsewhere that are un-marked by explicit script/language statements would suddenly also be registered for Greek too - though i guess they'd be no-op since no greek glyphs are probably used in them...

I'm warming up to the idea that for the split-by-script kern writer, we should just ignore the global languagesystem and use whatever scripts that we find among the kerned glyphs.

Thoughts @simoncozens @khaledhosny @madig ?

simoncozens commented 2 years ago

That sounds good to me, the only issue being that we still need to look at the languagesystem to be aware of what shaper the font goes through, so that we can put kerns in kern or dist appropriately. So we still end up looking at the languagesystems...

anthrotype commented 2 years ago

so that we can put kerns in kern or dist appropriately

once we know what scripts are used in kerning pairs, and have split these accordingly, we also know whether there are dist-enabled scripts or not

madig commented 2 years ago

It seems that the new kern writer should be reworked a bit to adjust for our new assumptions. I'll try and draw up an algorithm to better wrap my mind around it.

Also, blindly using unicodedata.script_extension instead of https://github.com/googlefonts/ufo2ft/blob/58d8b2bfa0f337aa529cf21bbebffbfd61dcbfeb/Lib/ufo2ft/featureWriters/kernFeatureWriter.py#L445-L454 gives me additional lookup kern_Thaa and lookup kern_Yezi blocks for Arabic numerals :) I suppose they don't hurt, but are they useful?

madig commented 2 years ago

@anthrotype what's the purpose of dealing with bases and marks in the kern writer?

anthrotype commented 2 years ago

I am not sure we want to use script_extension.. that may add unwanted/unccessary script records. How about we extend the global default languagesystems with the set of scripts (not script_extensions!) actually used by the kerning pairs. You make a single kern_Arab lookup and also register the same under Thana or Yezi or whatever "extension" scripts but only iff those are defined in the global languagesystems..

anthrotype commented 2 years ago

aargh this is getting too complicated now, too much guessing. Maybe using the default languagesystems as we were/are is not so bad after all, at least it's explicit and leaves the responsibility to the font developer. 🤔

madig commented 2 years ago

I think we should follow the guessing/inference approach until we hit an actual roadblock. I'll look into using the script rather than script_extension.

Also, I found another problem: enum pos em-cy [oslash.BRACKET.150 oslashacute.BRACKET.150] -10;. I think ufo2ft does not currently access Designspace rules for script discovery and the rules are tacked onto the binary not through features, but by writing into the final binary later in varLib. This particular line was in the feature output of a single UFO, so I think there's little we can do there. We probably need to pass on the info somehow at least when compiling from a Designspace, to avoid merging weirdness later?

madig commented 2 years ago

Note to self: look at trying to preserve the kerning pair type (glyph to glyph, class to class, ...) when splitting pairs, by emitting eg pos [a] [b c] instead of enum pos a [b c].

madig commented 2 years ago

You make a single kern_Arab lookup and also register the same under Thana or Yezi or whatever "extension" scripts

Maybe one could also classifyTools these specific multi-script pairs, group them into their own lookup and insert references in feature kern for each script 🤔

madig commented 2 years ago

Hm, if some codepoints are used in multiple scripts (e.g. Arabic numerals in Arabic and Thaana), is just storing them for Arabic gonna work? Are shapers going to run thaa and arab scripts? I think I'll keep the script_extensions for now and figure something out later.

madig commented 2 years ago

More thinking... since the BRACKET glyphs aren't reachable in the UFO, should the kern writer actually do a reachability analysis (look for Unicoded glyphs plus scour GSUB; additionally scan Designspace rules) and ignore glyphs that aren't reachable?

anthrotype commented 2 years ago

feature writers work on ufos, they don't know about designspace..

madig commented 2 years ago

Hm. So, how do we pass in higher level information? Add new attribute additionalGlyphSubstitutions: dict[GlyphName, GlyphName to BaseFeatureCompiler, mirroring Designspace rules? That may not be enough though, because feature writers preceeding the kern writer could add substitutions and thereby influence script discovery. Or, well, such modifications work on the feature file, so compileGSUB would include them. So this may work?

anthrotype commented 2 years ago

it might work but feels iffy

anthrotype commented 2 years ago

i'm thinking we should provide a way for users to override ufo2ft's inference of unicode properties such as script/bidi category based on cmap+GSUB for specific glyphs, we might need that anyway to support Glyphs.app's RTL kerning https://github.com/googlefonts/ufo2ft/issues/658#issuecomment-1271437448

madig commented 2 years ago

So, what do we want to be able to pass in? A glyphUnicodeMapping: dict[str, int] | None where the latter would trigger inference and the former is considered complete?

madig commented 2 years ago

Access to the Designspace is solved with a new VariableFeatureCompiler in https://github.com/googlefonts/ufo2ft/pull/635/files#diff-7d78e1bc9030aa197442bd5c7eca11381272a8c20f3467bbd7bea27d12bc17eeR355-R393. Having a separate glyphUnicodeMapping might still be useful.

anthrotype commented 2 years ago

Having a separate glyphUnicodeMapping might still be useful.

ok let's do that in a different pr though

madig commented 2 years ago

I believe my WIP does what we need, all class-to-class pairings now seem to have a disjoint kern1 classes I believe, from looking at output for yanone's private real world font right here.

I did spot two things though in an excerpt, one weird (two kern2 classes for the same kern1 parent that were almost identical except for one additional glyph???) and the other a possible size optimization opportunity. Consider the following class to class pairings:

[A B C] [a b] -10
[A B C] [c d] -10
[A B C] [e f] -20

can we collapse them into the following?

[A B C] [a b c d] -10
[A B C] [e f] -20

Or is this an optimization that only the otlLib subtable builder could do because kern2s have to be disjoint per subtable?

madig commented 2 years ago

Next steps:

Make sure class-to-class pairs stay that way instead of turning into glyph-to-class.
Deduplicate lookups (e.g. the test font generates Arabic numeral kerning for Arab, Thaa and Yezi; these can be put into their own lookup and just registered for these scripts instead of duplicating into 3 identical lookups)
Hook up test code to actual fea writing, take care of kern vs. dist
Do we need all that info on KerningPair?
RTL kerning pitfalls because the advance has to be the other way around? Special casing of RTL against punctuation (Arabic numbers against Arabic letters or punctuation)? Shapers treat this differently, can we help them somehow? Drop kerning between different bidi types????
Simplify code

madig commented 2 years ago

Random thought: if we split kerning groups into per-script subgroups, and just combine matching sides (plus the Common group) and put them into their script buckets, do we need to use classify as suggested in https://github.com/fonttools/fonttools/issues/2793? Given the same glyph to scripts mapping, groups are going to be split deterministically in the same way and they are guaranteed to have unique glyph membership per side (unless you turn off validation?)?

madig commented 2 years ago

I wrote https://github.com/daltonmaag/kerning-validator to help me ensure that the source kerning is actually applied by HarfBuzz. Works on UFOs only for now. @yanone

madig commented 2 years ago

Using the kerning validator, I managed to validate all of Noto (that I could compile) and a bunch of random UFOs I had lying around. The only exception are some zero kerns that should be something else in Noto Sans Tamil Regular and Light, but they are the same with mainline ufo2ft (I suspect some sort of source issue, but haven't looked into it). I ignored all warnings where the sources had floating point kern values, because of course those aren't going to match.

I am confident now that this splitter does the right thing.

googlefonts / ufo2ft

Split kerning by script, not by direction (second attempt) #667