ParseTTC duplicates work for tables shared between fonts

dominikh commented 6 months ago

In an OpenType font collection, some tables might be referenced by multiple fonts. For example, in the Noto Sans CJK font collection, all fonts refer to the same CFF2 table (and several others, but the CFF2 table is by far the largest). However, ParseTTC treats each font as an individual object, loading and parsing the same tables repeatedly. For Noto Sans CJK, this results in a 5x increase in I/O and processing time, loading the 30 MB CFF2 table five times, once per font.

I'm not sure that the ParseTTC API is a good idea in the first place (we may only ever want one font from the collection), but if it is to exist, it should at least exploit data deduplication.

benoitkugler commented 6 months ago

Huh..

We could do it, but it would involve a new API for loading collections, since we would need to track the shared tables. And the NewFont constructor would have to be adapted quite heavily..

whereswaldon commented 6 months ago

It seems worth doing given the potential savings. We could (potentially) still offer the simpler, less performant API for use cases that don't need the extra complexity.

dominikh commented 6 months ago

but it would involve a new API for loading collections

Which is IMO warranted, anyway, to make it easier to load fonts from a collection on demand, instead of all at once.

I'm currently tinkering on such a new API, I can send an RFC PR in a couple days if you'd like.

Edit: I take that back. The parsing of some tables depends on other tables, which makes it harder to implement table reuse cleanly, as different tables would need different cache keys to encode the dependencies. Being on the "receiving end" of trying to implement it, I'd probably want to see some stats as to how often large tables get reused. My intuition tells me that this is only really the case for CJK fonts with language defaults. Most uses of collections vary fonts by weight, width, slant, etc, which all require unique glyphs.

benoitkugler commented 6 months ago

Here hare some numbers to illustrate @dominikh point .

Details

/usr/share/fonts/opentype/noto/NotoSansCJK-Bold.ttc 10 faces CFF : 16023 KB -> used 10 times hmtx : 262 KB -> used 10 times vmtx : 261 KB -> used 10 times VORG : 0 KB -> used 10 times BASE : 0 KB -> used 10 times vhea : 0 KB -> used 10 times hhea : 0 KB -> used 10 times post : 0 KB -> used 10 times GDEF : 0 KB -> used 10 times maxp : 0 KB -> used 10 times OS/2 : 0 KB -> used 6 times OS/2 : 0 KB -> used 4 times GSUB : 177 KB -> used 2 times GSUB : 171 KB -> used 2 times GSUB : 167 KB -> used 2 times GSUB : 166 KB -> used 2 times GSUB : 166 KB -> used 2 times /usr/share/fonts/opentype/noto/NotoSansCJK-Regular.ttc 10 faces CFF : 15458 KB -> used 10 times hmtx : 262 KB -> used 10 times vmtx : 261 KB -> used 10 times VORG : 0 KB -> used 10 times BASE : 0 KB -> used 10 times hhea : 0 KB -> used 10 times vhea : 0 KB -> used 10 times post : 0 KB -> used 10 times GDEF : 0 KB -> used 10 times maxp : 0 KB -> used 10 times OS/2 : 0 KB -> used 6 times OS/2 : 0 KB -> used 4 times GSUB : 177 KB -> used 2 times GSUB : 171 KB -> used 2 times GSUB : 167 KB -> used 2 times GSUB : 166 KB -> used 2 times GSUB : 166 KB -> used 2 times /usr/share/fonts/opentype/noto/NotoSerifCJK-Bold.ttc 5 faces CFF : 24427 KB -> used 5 times hmtx : 261 KB -> used 5 times vmtx : 261 KB -> used 5 times VORG : 0 KB -> used 5 times BASE : 0 KB -> used 5 times hhea : 0 KB -> used 5 times vhea : 0 KB -> used 5 times post : 0 KB -> used 5 times GDEF : 0 KB -> used 5 times maxp : 0 KB -> used 5 times OS/2 : 0 KB -> used 3 times OS/2 : 0 KB -> used 2 times /usr/share/fonts/opentype/noto/NotoSerifCJK-Regular.ttc 5 faces CFF : 23442 KB -> used 5 times hmtx : 261 KB -> used 5 times vmtx : 261 KB -> used 5 times VORG : 1 KB -> used 5 times BASE : 0 KB -> used 5 times vhea : 0 KB -> used 5 times hhea : 0 KB -> used 5 times post : 0 KB -> used 5 times GDEF : 0 KB -> used 5 times maxp : 0 KB -> used 5 times OS/2 : 0 KB -> used 3 times OS/2 : 0 KB -> used 2 times

(I've not found any other collections on my system though.)

Perhaps a first step would be to only consider CFF, CFF2, and glyf tables (which are by far the most heavy ones) ?

andydotxyz commented 6 months ago

Which is IMO warranted, anyway, to make it easier to load fonts from a collection on demand, instead of all at once.

I appreciate that this is complex - but I agree that a collections based API may be a good thing, so we can lazy load less than a full collection.

I recently found that many OS provide all languages in a single file including all script based glyphs meaning big files and not particularly fast parses.

go-text / typesetting

ParseTTC duplicates work for tables shared between fonts #147