notes on subsetting & packing vs repacking

cmyr commented 2 years ago

I have a few thoughts related to these topics that may not be worth a full write-up, but which I would like to at least record in point form, while they're fresh.

packing versus repacking: these two tasks are different in important ways; the biggest, for me, is that in the case of repacking, we know that there is an existing solution to the graph, and assuming we have not done anything to massively increase the size of any tables, this solution should be quite 'shallow in the search space', that is it should be easy to find. When packing from scratch, there is no guarantee that a solution exists, and the packer may need to execute more complex strategies (such as table splitting and extension promotion) that should not be necessary when repacking.
repacking after subsetting should have a trivial solution: Assume that, given a font F and a subset of that font F', each table in F' should be equal or lesser in size than the corresponding table in F. Given this, we can pack F' simply by reusing the original table ordering of F. This requires us to avoid things like deduplication, which means we may end up with a sub-optimal solution in certain cases, but should generally reflect the the quality of the original packing job.
repacking for webfont delivery has clearly different tradeoffs versus packing for general distribution: When compiling a font from source, I believe it would generally be desirable to increase compile time in order to produce more efficient binaries. When repacking a font in response to an http request, this is less true. A table packer that is ideally suited to compilation may not be suited to repacking for webfont delivery.

Ultimately I'm not sure how relevant these points are to any specific design decisions, but I think they're worth keeping in mind when working further on this problem.

rsheeter commented 2 years ago

@garretrieger are these valid assertions? Perhaps especially, is it always true that post-subsetting has a trivial solution and every table will shrink?

Lorp commented 2 years ago

A simple counterexample would be when subsetting breaks a contiguous run of Unicode ids. The cmap table of a font containing ABCDEFGHIJKLMNOPQRSTUVWXYZ grows when the font is subset to ACEGIKMOQSUWY.

cmyr commented 2 years ago

@Lorp great example. I suspect in practice that size difference will rarely render a valid packing invalid, but it does definitely introduce more doubt.

garretrieger commented 2 years ago

Re: possible size increase after subsetting: this can be hard to predict, while in general yes the subsetted version of a table should be smaller there are definitely cases where it's not. For example if your subset manages to intersect most/all of the rules in a GSUB table (not hard in some fonts which uses rules which apply to classes of characters) then there's a chance you'll end up with a larger end product as the coverage tables grow due to ranges being broken up. However, to date I haven't seen a case with harfbuzz subsetting where our repacker isn't able to repack a subset even though we don't do extension promotion and table splitting. So the assertions here are likely correct for pretty much all typical subsetting cases.

FYI I just added an extension promotion mechanism into the hb repacker, though it's currently only enabled for font authoring cases and skipped for subsetting. It's pretty efficient and should be fast enough for use in subsetting in the future: https://github.com/harfbuzz/harfbuzz/pull/3745. I use a simple greedy algorithm to decide which lookups to leave as non-ext and then promote the rest to extension. A more complete implementation algorithm would use a dynamic programming backpack problem like solution, but I suspect that's not needed in practice. Once I get connected component analysis and demotion added to the extension promotion implementation I'd like to turn it on for subsetting as well as it would allow us to to re-optimize the set of extension lookups to further reduce file size during subsetting. For example if things shrink enough we can possibly remove some or all extensions saving 8 bytes per subtable.

Stepping back a bit, over time I've moved my stance on repacking away from using small iterative interventions (ie. node duplication and node reordering via priority changes) to prefering the use of broad strokes mechanisms like extension promotion, space assignment, and table splitting. These broad strokes mechanisms allow us to make changes to the graph that have pretty strong guarantees for improving the packability of the graph. I believe the general case of packing GSUB/GPOS (including in font authoring) can be fully solved by the smart application of only extension promotion, table splitting, space assignment, and smart topological sort. If those are implemented well then there should pretty much be no need for application of iterative fixes (duplication, and manual reordering of nodes):

In the worst case space assignment + extension promotion can cause each subtable to be packed fully independently from each other sub table. This reduces the packing case to figuring out a way to pack each subtable on it's own.
Then if the subtable is too large to be packed, table splitting will allow it to be broken into 2 or more subtables, where again space assignment can isolate them if needed. In practice we should be able to repeatedly split a subtable until each new subtable is guaranteed to be packable.

Extension promotion and table splitting should be able to be done in a single pass up front in a way that produces a graph that's guaranteed to be packable. This can be done by looking at the sizes of subgraphs and ensure they are always less than 64kb. Basically you just choose extensions and table splits until space 0 and each ExtSubTable's have a subgraph that's less than 64kb. See: https://github.com/harfbuzz/harfbuzz/pull/3745 for how I'm doing that for space 0 in the extension promotion case. From there you initially allow maximal table sharing and then iteratively shuffle subtables into new spaces as needed until you get something that packs (as we currently do today).

garretrieger commented 2 years ago

It should be noted that the above discussion applies specifically to GSUB/GPOS, though I believe those are the only two tables that we can't reliably pack with just a smart topological sort. COLRv1 and others are not typically problematic (thanks to the use of 24bit offsets).

garretrieger commented 2 years ago

Direct link to the implementation of extension promotion algorithm: https://github.com/harfbuzz/harfbuzz/pull/3745/files#diff-b9d04c6bdc2f2b780c733fb97d5318cb3ec902dd0e2835982bdfa04c5e0ae5faR74

garretrieger commented 2 years ago

Another update, I've now added table splitting starting with PairPosFormat1+2: https://github.com/harfbuzz/harfbuzz/pull/3779

googlefonts / oxidize

notes on subsetting & packing vs repacking #32