handling of variation selectors in Unicode RGI emoji sequences

For historical reasons (Why is the sky blue? to be answered elsewhere), Unicode defines standardized variation sequences for many RGI emoji. These can be cases with just a base + VS for an emoji, such as <2640 FE0F> for the female sign emoji (♀️), or cases in which this is part of a longer emoji ZWJ sequence, such as <26F9 1F3FB 200D 2640 FE0F> for the woman bouncing ball emoji (⛹🏻‍♀️).

The full set of standardized variation sequences for emoji are defined in a Unicode Character Database data file: https://www.unicode.org/Public/UCD/latest/ucd/emoji/emoji-variation-sequences.txt.

The full set of Unicode RGI non-ZWJ sequences is defined in a UTS # 51 data file: https://www.unicode.org/Public/emoji/15.0/emoji-sequences.txt And the full set of RGI emoji ZWJ sequences are defined in another UTS # 51 data file: https://www.unicode.org/Public/emoji/15.0/emoji-zwj-sequences.txt.

In the workflow for developing an emoji font, someone could name a set of SVG assets using the Unicode RGI sequences, resulting in many of the asset names including "_FE0F" as an element. Currently, nanoemoji would assume that FE0F is just another Unicode character that gets a cmap entry and then gets used within GSUB ligature substitutions. That could lead to problems, however: while the variation selector FE0F is included in many Unicode RGI sequences, the variation selector is actually optional: Unicode allows VSs to be used to select between "emoji" (full colour) or "text" (monochrome) presentation, but in an emoji font full colour can be assumed even without the variation selector. Thus, using U+2640 as an example, either <2640> or <2640 FE0F> should result in the same glyph ID and that glyph ID should be a base GID within the COLR table. Nanoemoji will generate a GSUB ligature substitution for the latter case that results in a different GID for <2640 FE0F> than for <2640>.

Problems could be avoided if a font developer makes sure not to include "FE0F" in the names of any input assets, but then uses a post-process to add a 0/5/14 cmap subtable. (That could be done using the emoji-variation-sequences.txt data file as an input.)

That functionality could also be built into nanoemoji directly.

A potential alternative in nanoemoji would be to ensure that ligature substitutions of produce the same glyph ID as baseGID rather than a different glyph ID. But variation selector characters are default ignorable in Unicode, for which reason there could be some layout implementations that suppress variation selectors from the glyph sequence. Adding the 0/5/14 cmap subtable is a more robust approach.

googlefonts / nanoemoji

handling of variation selectors in Unicode RGI emoji sequences #449