googlefonts / oxidize

Notes on moving tools and libraries to Rust.
Apache License 2.0
173 stars 7 forks source link

notes on subsetting & packing vs repacking #32

Open cmyr opened 2 years ago

cmyr commented 2 years ago

I have a few thoughts related to these topics that may not be worth a full write-up, but which I would like to at least record in point form, while they're fresh.

Ultimately I'm not sure how relevant these points are to any specific design decisions, but I think they're worth keeping in mind when working further on this problem.

rsheeter commented 2 years ago

@garretrieger are these valid assertions? Perhaps especially, is it always true that post-subsetting has a trivial solution and every table will shrink?

Lorp commented 2 years ago

A simple counterexample would be when subsetting breaks a contiguous run of Unicode ids. The cmap table of a font containing ABCDEFGHIJKLMNOPQRSTUVWXYZ grows when the font is subset to ACEGIKMOQSUWY.

cmyr commented 2 years ago

@Lorp great example. I suspect in practice that size difference will rarely render a valid packing invalid, but it does definitely introduce more doubt.

garretrieger commented 2 years ago

Re: possible size increase after subsetting: this can be hard to predict, while in general yes the subsetted version of a table should be smaller there are definitely cases where it's not. For example if your subset manages to intersect most/all of the rules in a GSUB table (not hard in some fonts which uses rules which apply to classes of characters) then there's a chance you'll end up with a larger end product as the coverage tables grow due to ranges being broken up. However, to date I haven't seen a case with harfbuzz subsetting where our repacker isn't able to repack a subset even though we don't do extension promotion and table splitting. So the assertions here are likely correct for pretty much all typical subsetting cases.

FYI I just added an extension promotion mechanism into the hb repacker, though it's currently only enabled for font authoring cases and skipped for subsetting. It's pretty efficient and should be fast enough for use in subsetting in the future: https://github.com/harfbuzz/harfbuzz/pull/3745. I use a simple greedy algorithm to decide which lookups to leave as non-ext and then promote the rest to extension. A more complete implementation algorithm would use a dynamic programming backpack problem like solution, but I suspect that's not needed in practice. Once I get connected component analysis and demotion added to the extension promotion implementation I'd like to turn it on for subsetting as well as it would allow us to to re-optimize the set of extension lookups to further reduce file size during subsetting. For example if things shrink enough we can possibly remove some or all extensions saving 8 bytes per subtable.

Stepping back a bit, over time I've moved my stance on repacking away from using small iterative interventions (ie. node duplication and node reordering via priority changes) to prefering the use of broad strokes mechanisms like extension promotion, space assignment, and table splitting. These broad strokes mechanisms allow us to make changes to the graph that have pretty strong guarantees for improving the packability of the graph. I believe the general case of packing GSUB/GPOS (including in font authoring) can be fully solved by the smart application of only extension promotion, table splitting, space assignment, and smart topological sort. If those are implemented well then there should pretty much be no need for application of iterative fixes (duplication, and manual reordering of nodes):

garretrieger commented 2 years ago

It should be noted that the above discussion applies specifically to GSUB/GPOS, though I believe those are the only two tables that we can't reliably pack with just a smart topological sort. COLRv1 and others are not typically problematic (thanks to the use of 24bit offsets).

garretrieger commented 2 years ago

Direct link to the implementation of extension promotion algorithm: https://github.com/harfbuzz/harfbuzz/pull/3745/files#diff-b9d04c6bdc2f2b780c733fb97d5318cb3ec902dd0e2835982bdfa04c5e0ae5faR74

garretrieger commented 2 years ago

Another update, I've now added table splitting starting with PairPosFormat1+2: https://github.com/harfbuzz/harfbuzz/pull/3779