Possible "all other codepoints" semantic?

skef commented 1 month ago

I've been thinking about various segmentation patterns one might want to support. Suppose you have a website that supports documentation in 15 "locales", which boil down to, say, 11 scripts. Some pages have text with more than one script. So you decide to segment the codepoints for the dependent patches into 12 buckets, 11 for each of those scripts and then 12 for "other".

Then in the initial font you want an arrangement like this:

11 patches for each of the scripts
<110 patches for the combinations of any two scripts (too lazy to work through the math of removing the functional duplicates)
1 fallback patch that adds "everything not in the initial font"

Obviously if you patch with 3 then there's nothing else to add. If you patch with one of the 1s then it will replace the map with one that has 10 patches for each additional script and an 11th to add "everything not at the current patch level".

This seems like an attractive arrangement because it always means at most an additional round trip for getting the patches, which you'll need anyway if you're doing glyph-keyed patches (and even if you aren't it's still an attractive property).

What this makes me wonder is: Should there be a simple, compact way in the patch map to express "all other supported codepoints", instead of having to list them all out?

The immediate drawback of such a thing is that if the client needs to support extra codepoint x, it can't tell whether x is supported in the font, so it will load the fallback patch anyway. Or, rather, it can't tell whether x is supported if you're not also doing glyph-keyed patches. If you are doing glyph-keyed patches, then the client can look at the set codepoints in the table for those patches and determine whether x is supported, and only load the "everything else" dependent patch if it's loading a glyph-keyed patch with an "extra" codepoint. (Maybe you'd want to require that the mapped glyphs in glyph-keyed patches respect that "boundary" -- not grouping glyphs from different dependent segments in the same bin -- and maybe not, I haven't thought that through.)

If you aren't doing glyph-keyed patches then you wouldn't be able to tell, and maybe that's fine. The encoder can decide if the more compact encoding is worth the cost of sometimes loading a patch unnecessarily.

Anyway, just some thoughts I thought we might want to discuss ...

garretrieger commented 1 month ago

In format 2 you can fairly compactly represent the 3) patch by using the copy indices mechanism. In this example you would construct a mapping entry that copies the codepoint sets from the all 1) entries and then include a small additional set of codepoints for anything in the original font but not covered by those.

I think that is probably sufficiently compact for the situation you've described here and won't suffer from the draw back of matching codepoints that are not in the original font. If for a particular situation you're not worried about accidentally matching codepoints not in the original font then it is currently possible to construct an entry which matches all codepoints (via format 2) by not specifying a codepoint set in the entry.

skef commented 1 month ago

copies the codepoint sets from the all 1) entries and then include a small additional set of codepoints for anything in the original font but not covered by those.

I'm thinking about large fonts with wide coverage, e.g. a Noto font. What if the number of codepoints outside of the 11 specific locales is 3 times the number they contain together? Or what if there are just 3 locales of interest, and there are 10 times more code points outside of those?

Some of our discussion seems to implicitly assume a roughly symmetric segmentation, and that might not always reflect the relevant use cases.

garretrieger commented 1 month ago

For what I proposed the relative sizes of the segments isn't important. Assuming you don't want to match characters not in the original font it's going to be necessary to encode somewhere all of the codepoints found in the original font. Then the goal is to just ensure we're not repeating the encoding of the same codepoint in multiple segments, which the copy mechanism solves. The specific distribution of codepoints across segments, including the catch-all shouldn't make a significant difference into the total encoded size which will roughly be a function of the total number of codepoints in the original font.

If you have a catch-all segment which contains a very large number of codepoints and want to avoid encoding those then, a catch all entry which has no codepoint set specified can be used which will match all unicode codepoints. However, this comes at the cost of triggering the catch all load if codepoints not found in the original font are encountered. For some cases like a noto font that intends to cover all or nearly all of unicode this is probably reasonable.

One missing piece of functionality that I'm seeing, is a subtraction mechanism (eg. specifying the entry should match all codepoints minus a set of codepoints from other segments). For example lets say we are encoding a noto case which covers all of unicode and want split the font into three segments A, B, and C where C is much larger than A and B.

This could be optimally encoded (using subtraction) in three segments at the cost of encoding only A and B:

1. A
2. B
3. * - A - B

If we do want the subtraction mechanism I think that could be added by using the msb of the copy index to indicate subtraction instead of addition for that entry index and dropping the entry index value down to 23 bits.

skef commented 1 month ago

If I'm understanding right I think there might be two things to consider:

Should there be some sort of subtraction mechanism?
Should the patch "referring" mechanism optionally work between tables (IFT vs IFTX)? (This would allow the dependent patch codepoints to be built in reference to sets of glyph-keyed patches (for example).

garretrieger commented 2 weeks ago

At the working group call today I mentioned a possible change to format 1 to take advantage of the fully present cmap in order to map a group of remaining codepoints to an entry. Here's how that could be done:

In GlyphMap add two new fields prior to entryIndex:

mappedGlyphCount (uint24)
defaultEntryIndex (uint8/uint16)

Then the mapping would work like so:

entryIndex[] has length mappedGlyphCount and provides a gid -> entry mapping for glyphs from firstMappedGlyph to firstMappedGlyph + mappedGlyphCount - 1.
So then we have the following mappings. Glyphs: 0..firstMappedGlyph - 1 map to entry 0 firstMappedGlyph..firstMappedGlyph + mappedGlyphCount - 1 map based on entryIndex[] firstMappedGlyph + mappedGlyphCount..glyphCount - 1 map to defaultEntryIndex

What do you think?

w3c / IFT

Possible "all other codepoints" semantic? #192