Closed madig closed 1 year ago
While investigating https://github.com/googlefonts/ufo2ft/issues/658, I stumbled over the following:
lookup_Latn
may suddenly contain both pos [Y ...] [A ...]
and pos [Upsilon ...] [A ...]
because the original pos @kern1.Y @kern2.A
that contained their union was split up into the same lookup. I'd assume they should land in different ones?None, None
), while the next one has the correct one, but is ignored by shapers. Additionally, sorting the statements manually shaved 10 KB off the font size on my test font source. Reverting to fontTools 4.37.1 (before https://github.com/fonttools/fonttools/pull/2776) does not help.So, in the test font provided by yanone, the feature file does not declare languagesystem grek dflt;
explicitly in the features.fea file, but the kern writer uses these declarations to sort kern pairs into script buckets. Greek glyphs are then considered script-less, which is wrong, and they get dumped into random other lookups. Uh. Should the writer even do this when kerning typically touches more scripts that GSUB? I'd derive the scripts in use purely from the glyph set? @simoncozens
Side-note: the code in https://github.com/googlefonts/ufo2ft/blob/58d8b2bfa0f337aa529cf21bbebffbfd61dcbfeb/Lib/ufo2ft/featureWriters/kernFeatureWriter.py#L369 sorts the kerning pairs by type, but this is later undone in https://github.com/googlefonts/ufo2ft/blob/58d8b2bfa0f337aa529cf21bbebffbfd61dcbfeb/Lib/ufo2ft/featureWriters/kernFeatureWriter.py#L501 when splitting a pair can result in different types of kerning pairs. The sorting can probably be moved down to the split kern emitter.
From call:
fontTools.misc.classifyTools
within each bucket to make disjoint kern1 classes so that coverage per PairPos subtable is unique (to avoid potential fontTools otlLib subtable break bug).
Also, the PairPos coverage must be unique inside a lookup, otherwise things are unreachable. That's an invariant to follow.
So, regarding https://github.com/googlefonts/ufo2ft/pull/667#issuecomment-1281381467, we either:
Maybe I'll try the latter and see what happens in various shaping engines.
the feature file does not declare languagesystem grek dflt; explicitly in the features.fea file, but the kern writer uses these declarations to sort kern pairs into script buckets. Greek glyphs are then considered script-less, which is wrong, and they get dumped into random other lookups.
This thing of using the globally declared default languagesystems pre-dates the split-kerning-by-scripts PR, but maybe doesn't make sense any more and should be revised.
In the previous code in which we'd only split lookups between LTR/RTL, the default languagesystems were parsed and grouped into buckets depending on whether they were associated with "kern" or "dist", and on the script's horizontal writing direction (LTR/RTL), stored in self.context.scriptGroups
dict of dicts (see _groupScriptsByTagAndDirection
)
Then, if there was no globally declared script associated with "dist" feature, none would be generated.
Then, kerning pairs would be split into three lookups: LTR, RTL and a kern_dflt
for pairs whose glyphs' script/bidi properties don't fit either LTR or RTL (e.g. punctuation, numbers, etc.). Finally, when deciding under which script/language/feature (kern vs dist) to register each of these lookups, again these script groupings derived from the global default languagesystems would be used (e.g. see _registerKernLookups
). E.g. if kerning.plist contains pairs between LTR glyphs but default languagesystems only declares RTL scripts and no LTR ones, then the kern_ltr
lookup would be registered under DFLT/dflt.
What we could have done maybe was to register such kern_ltr
, not under DFLT/dflt, but under all the LTR scripts that are actually used therein e.g. by adding script grek;
and such statements, even if they are not globally declared in the default languagesystems (which we know it's ok). I'm honestly not sure and can't remember why we didn't actually do that...
Maybe we didn't feel like we could safely infer stuff and preferred to leave it to the font developer to decide about.
Or maybe it was just simpler and OK enough that way.
Even so, imposing the font developer to write explicit default languagesystems to control the behavior of the kern feature writer may not always be desirable, as those apply indiscriminately to both GPOS and GSUB. In the example brought forward by Nikolaus, the GSUB features work fine without Greek being declared among the default languagesystems, but GPOS kerning may need those otherwise ufo2ft places greek kerning under DFLT/dflt; but if greek were added to default languagesystems then the meaning of all the pos/sub rules elsewhere that are un-marked by explicit script/language statements would suddenly also be registered for Greek too - though i guess they'd be no-op since no greek glyphs are probably used in them...
I'm warming up to the idea that for the split-by-script kern writer, we should just ignore the global languagesystem and use whatever scripts that we find among the kerned glyphs.
Thoughts @simoncozens @khaledhosny @madig ?
That sounds good to me, the only issue being that we still need to look at the languagesystem to be aware of what shaper the font goes through, so that we can put kerns in kern
or dist
appropriately. So we still end up looking at the languagesystems...
so that we can put kerns in kern or dist appropriately
once we know what scripts are used in kerning pairs, and have split these accordingly, we also know whether there are dist-enabled scripts or not
It seems that the new kern writer should be reworked a bit to adjust for our new assumptions. I'll try and draw up an algorithm to better wrap my mind around it.
Also, blindly using unicodedata.script_extension
instead of https://github.com/googlefonts/ufo2ft/blob/58d8b2bfa0f337aa529cf21bbebffbfd61dcbfeb/Lib/ufo2ft/featureWriters/kernFeatureWriter.py#L445-L454 gives me additional lookup kern_Thaa
and lookup kern_Yezi
blocks for Arabic numerals :) I suppose they don't hurt, but are they useful?
@anthrotype what's the purpose of dealing with bases and marks in the kern writer?
I am not sure we want to use script_extension.. that may add unwanted/unccessary script records. How about we extend the global default languagesystems with the set of scripts (not script_extensions!) actually used by the kerning pairs. You make a single kern_Arab lookup and also register the same under Thana or Yezi or whatever "extension" scripts but only iff those are defined in the global languagesystems..
aargh this is getting too complicated now, too much guessing. Maybe using the default languagesystems as we were/are is not so bad after all, at least it's explicit and leaves the responsibility to the font developer. 🤔
I think we should follow the guessing/inference approach until we hit an actual roadblock. I'll look into using the script rather than script_extension.
Also, I found another problem: enum pos em-cy [oslash.BRACKET.150 oslashacute.BRACKET.150] -10;
. I think ufo2ft does not currently access Designspace rules for script discovery and the rules are tacked onto the binary not through features, but by writing into the final binary later in varLib. This particular line was in the feature output of a single UFO, so I think there's little we can do there. We probably need to pass on the info somehow at least when compiling from a Designspace, to avoid merging weirdness later?
Note to self: look at trying to preserve the kerning pair type (glyph to glyph, class to class, ...) when splitting pairs, by emitting eg pos [a] [b c]
instead of enum pos a [b c]
.
You make a single kern_Arab lookup and also register the same under Thana or Yezi or whatever "extension" scripts
Maybe one could also classifyTools these specific multi-script pairs, group them into their own lookup and insert references in feature kern
for each script 🤔
Hm, if some codepoints are used in multiple scripts (e.g. Arabic numerals in Arabic and Thaana), is just storing them for Arabic gonna work? Are shapers going to run thaa
and arab
scripts? I think I'll keep the script_extensions
for now and figure something out later.
More thinking... since the BRACKET glyphs aren't reachable in the UFO, should the kern writer actually do a reachability analysis (look for Unicoded glyphs plus scour GSUB; additionally scan Designspace rules) and ignore glyphs that aren't reachable?
feature writers work on ufos, they don't know about designspace..
Hm. So, how do we pass in higher level information? Add new attribute additionalGlyphSubstitutions: dict[GlyphName, GlyphName
to BaseFeatureCompiler
, mirroring Designspace rules? That may not be enough though, because feature writers preceeding the kern writer could add substitutions and thereby influence script discovery. Or, well, such modifications work on the feature file, so compileGSUB
would include them. So this may work?
it might work but feels iffy
i'm thinking we should provide a way for users to override ufo2ft's inference of unicode properties such as script/bidi category based on cmap+GSUB for specific glyphs, we might need that anyway to support Glyphs.app's RTL kerning https://github.com/googlefonts/ufo2ft/issues/658#issuecomment-1271437448
So, what do we want to be able to pass in? A glyphUnicodeMapping: dict[str, int] | None
where the latter would trigger inference and the former is considered complete?
Access to the Designspace is solved with a new VariableFeatureCompiler
in https://github.com/googlefonts/ufo2ft/pull/635/files#diff-7d78e1bc9030aa197442bd5c7eca11381272a8c20f3467bbd7bea27d12bc17eeR355-R393. Having a separate glyphUnicodeMapping
might still be useful.
Having a separate glyphUnicodeMapping might still be useful.
ok let's do that in a different pr though
I believe my WIP does what we need, all class-to-class pairings now seem to have a disjoint kern1 classes I believe, from looking at output for yanone's private real world font right here.
I did spot two things though in an excerpt, one weird (two kern2 classes for the same kern1 parent that were almost identical except for one additional glyph???) and the other a possible size optimization opportunity. Consider the following class to class pairings:
can we collapse them into the following?
Or is this an optimization that only the otlLib subtable builder could do because kern2s have to be disjoint per subtable?
Next steps:
Random thought: if we split kerning groups into per-script subgroups, and just combine matching sides (plus the Common group) and put them into their script buckets, do we need to use classify as suggested in https://github.com/fonttools/fonttools/issues/2793? Given the same glyph to scripts mapping, groups are going to be split deterministically in the same way and they are guaranteed to have unique glyph membership per side (unless you turn off validation?)?
I wrote https://github.com/daltonmaag/kerning-validator to help me ensure that the source kerning is actually applied by HarfBuzz. Works on UFOs only for now. @yanone
Using the kerning validator, I managed to validate all of Noto (that I could compile) and a bunch of random UFOs I had lying around. The only exception are some zero kerns that should be something else in Noto Sans Tamil Regular and Light, but they are the same with mainline ufo2ft (I suspect some sort of source issue, but haven't looked into it). I ignored all warnings where the sources had floating point kern values, because of course those aren't going to match.
I am confident now that this splitter does the right thing.
This pulls the commits for #636 and some follow-ups aside, to unblock releases from main until the code is more thoroughly tested.
Relevant issues:
Todo:
Figure out what would need to be done to make variable fonts compile and behave correctly when not all kerning pairs are defined across all sources. The issue already existed before, but since we were splitting by direction, which is a much coarser property, it rarely was problematic.separate issueCloses https://github.com/googlefonts/ufo2ft/issues/658