generated font file contains erroneous cmap entries resulting in blank ascii characters

reticivis-net commented 1 year ago

I'm using nanoemoji to build a COLR0 font from Twitter's Twemoji. The generated font file contains cmap entries to blank glyphs for raw ascii numbers. This (probably) isn't an issue with fontTools as I wrote a script using fontTools to fix it. My guess is that for the number emoji, it detects the initial unicode codepoint being a number, and adds an entry for it. the number emoji are the ascii numbers followed by U+20E3.

import glob
import subprocess

from fontTools.ttLib import TTFont

# this code generates the font file from twemoji
proc = subprocess.run(["nanoemoji",
                       "--family", "Twemoji Color Emoji",
                       "--color_format", "cff_colr_0",
                       "--version_major", "1",
                       "--version_minor", "1",
                       "--output_file", "TwemojiCOLR0.otf",
                       # only works on linux due to length limit
                       *(glob.glob("./twemoji/assets/svg/*"))
                       ])

# this code fixes it
if proc.returncode == 0:
    print("Fixing cmap...")
    font = TTFont("build/TwemojiCOLR0.otf")
    for i in range(len(font["cmap"].tables)):
        keys = list(font["cmap"].tables[i].cmap.keys())
        for k in keys:
            if k < 100:
                del font["cmap"].tables[i].cmap[k]

    font.save("TwemojiCOLR0.otf")

anthrotype commented 1 year ago

hello, thanks for using our tools! I suspect this may have to do with the way the input svg files are named. nanoemoji attempts to parse the filename to come up with the unicode sequences that are then encoded either in cmap (for single unicode codepoints) or in GSUB ccmp feature for ligature-style sequences involving multiple codepoints.

https://github.com/googlefonts/nanoemoji/blob/2ffdd23faad153c5cec763b6e210befc9d704cd1/src/nanoemoji/codepoints.py#L22-L26

What are the offending svg files named?

anthrotype commented 1 year ago

you know, to workaround the shell limits on the number of arguments, you can write a .toml file and pass that to nanoemoji as the input (instead of a list of .svg files) and it will use that to find the input sources and set all the other flags.

E.g. https://github.com/googlefonts/color-fonts/blob/main/config/twemoji_smiley-glyf_colr_1.toml

reticivis-net commented 1 year ago

I saw that bit in the codepoints file trying to debug it myself earlier. The thing is that I tested the regex on the twemoji scheme (codepoints separated by dashes) and it worked fine. I even used PowerRename on windows to change the files into noto-like filenames (emoji_u then codepoints separated by underscores) and the generated file still needed fixing. Both generated files displayed compound(?) emojis just fine so it's able to parse the codepoints just fine.

anthrotype commented 1 year ago

My guess is that for the number emoji, it detects the initial unicode codepoint being a number, and adds an entry for it.

I think you're correct. There's a method which adds empty glyphs for all the codepoints that are used in the emoji sequences:

https://github.com/googlefonts/nanoemoji/blob/2ffdd23faad153c5cec763b6e210befc9d704cd1/src/nanoemoji/write_font.py#L688-L713

I think we did that to avoid showing tofus for incomplete sequences -- though I don't know what exactly would happen if we didn't do that /cc @rsheeter maybe knows/remembers why

But what's the real issue if you have those additional mappings? Are you worried about the extra bytes?

reticivis-net commented 1 year ago

The problem is that if Twemoji is set as a font among multiple in, for example, Pango, it will output blank characters for numbers instead of letting it fallback to the fonts below it. You could just prioritize another font over Twemoji but the other font might have some emojis you don't want or maybe you are only specifying Twemoji and want numbers to fall back to the system font. Regardless, it's unexpected behavior that is best avoided imo

rsheeter commented 1 year ago

Good catch, I think this is my fault and a bug.

anthrotype commented 1 year ago

Actually, thinking about it a bit more, I know why we need to do that. Because emoji sequences are implemented as GSUB ligatures, font needs to have these cmap mappings to individual glyphs that comprise the ligatures (whether the glyph contours are empty or not), in order to be able to substitute a sequence of glyphs with a single "ligature" glyph. The "G" from GSUB stands for glyph substitutions, not characters... E.g. one can't have an f_i ligature without also having glyphs f and the i which in turn are mapped to unicode codepoints 0x66 and 0x69 in the cmap. When shaping text, cmap is first applied and then the GSUB -- the latter by itself can't do nothing. Each components of a GSUB ligature substitution are glyphs that in turn must either be encoded directly via cmap, or if unencoded, substituted by some other GSUB rule, which trace them back to some cmapped glyph at some point, otherwise one can't ever be activated or typed from a keyboard.

anthrotype commented 1 year ago

Ok, I just made two test Twemoji fonts, one the same as yours, with all codepoints < 100 removed from the cmap table after build, and another font with the cmap untouched, thus containing explicit mappings for the ASCII numbers. You can find it in this zip file, which also includes an html file that compares both fonts side by side with a string that includes the KeyCaps emoji, i.e. #️⃣*️⃣0️⃣1️⃣2️⃣3️⃣4️⃣5️⃣6️⃣7️⃣8️⃣9️⃣

Twemoji-COLRv0.zip

You can see in this screenshot below that the top line, with the original font built with nanoemoji, displays Twemoji's keycap glyphs correctly, whereas the bottom line, with the font stripped of ASCII number cmappings, falls back to the system's emoji (in my ChromeOS case this is NotoColorEmoji) because the browser (Chrome) can't compose the keycap sequences using the Twemoji-COLRv0-no-colors.otf font, and thus resorts to the fallback system font:

Screenshot 2022-09-23 19 23 03

So your stripping those entries from cmap makes it impossible to compose those sequences. I think this is intended behavior and we don't want to change that. I actually don't know what best way would be to make this work with font fallback chains.

reticivis-net commented 1 year ago

Ah that makes sense. I suppose putting the emoji font as the fallback font would work and if needed I can just strip emojis from the top font

rsheeter commented 1 year ago

I actually don't know what best way would be to make this work with font fallback chains.

In looking at the Android chain the solution appears to be putting it quite late in the chain plus, roughly, matching on the whole grapheme.

I believe that means there's nothing to change in nanoemoji, please reopen if I misunderstood. I'm glad @anthrotype is here, I'd about convinced myself we had a bug.

googlefonts / nanoemoji

generated font file contains erroneous cmap entries resulting in blank ascii characters #436