satbyy / go-noto-universal

Noto fonts go universal! Download pan-Unicode, merged Noto fonts according to time of usage (current, ancient) or geographical region (South Asia, SE Asia, Africa-MiddleEast, Europe-Americas).
Other
157 stars 22 forks source link

cmap overflow? cannot merge CJK with GoNotoContemporary.ttf #16

Closed satbyy closed 2 years ago

satbyy commented 2 years ago

Latest CI builds create GoNotoContemporary.ttf which is a superset of all regional fonts (Asia + Africa + Europe + Americas), excluding historical scripts (and sign-writing). The only missing region is East Asia, aka CJK.

It has plenty of room for expansion, as of now it encompasses 11706 codepoints and 34256 glyphs.

otfinfo -u GoNotoContemporary.ttf | wc -l
11706
otfinfo -g GoNotoContemporary.ttf | wc -l
34256

So there is space for at least (65K - 34K) ~ 30K glyphs before we max out the 65535 glyph limit.

We also generate GoNotoCJKCore.otf which has about 10K code points and 20K glyphs, so it should all nicely fit-in the same font.

However, the idea fails because cmap table format 4 hits the 65535 limit.

otf2ttf GoNotoCJKCore2005.otf
pyftmerge --verbose --drop-tables+=vmtx,vhea GoNotoContemporary.ttf GoNotoCJKCore2005.ttf
# ... 
File "/home/ubuntu/projects/go-noto-universal/venv_fonty/lib/python3.8/site-packages/fontTools/ttLib/tables/_c_m_a_p.py", line 172, in compile
    chunk = table.compile(ttFont)
  File "/home/ubuntu/projects/go-noto-universal/venv_fonty/lib/python3.8/site-packages/fontTools/ttLib/tables/_c_m_a_p.py", line 904, in compile
    header = struct.pack(cmap_format_4_format, self.format, length, self.language,
struct.error: 'H' format requires 0 <= number <= 65535

The actual length is about 66600, so just a little over 65K. The aim of this issue/ticket is to figure out a way to overcome the cmap limit.

What is strange is that the original, non-subsetted CJK font itself has cmap length about 40K but the subsetted CJK has 51K cmap length!

A brute-force way I found is to use --no-layout-closure while subsetting, but it also removes locl feature, so JP or KR cannot use a CN font.

dscorbett commented 2 years ago

NotoSansCJKsc-Regular.otf’s 'cmap' table is shorter because it has fewer segments. A single continuous range of code points is more efficient for 'cmap' subtable format 4, but subsetting punches lots of little holes in the code point coverage.

$ spot -t cmap cached_fonts/NotoSansCJKsc-Regular.otf | grep segCountX2 | uniq
segCountX2   =1364
$ spot -t cmap GoNotoCJKCore2005.otf | grep segCountX2 | uniq
segCountX2   =10088
satbyy commented 2 years ago

Thanks, I am a font hobbyist, so finding my way still :) Do you have any ideas on how to accomplish this task? From what you said, I think I'll try to reduce the number of "holes" in the range of code points covered.

dscorbett commented 2 years ago

That sounds like a good plan. You could find all the small ranges of ideographs not included in IICore and add support for them. The more gaps you fill, the smaller 'cmap' will be, but the bigger everything else will be. There’s a trade-off and it will require some experimentation to determine what counts as a small enough range.

satbyy commented 2 years ago

Thanks David. Seems your idea worked. I blindly added all codepoints in range 0x4E00 - 0x5FFF to the IICore list. The resulting subsetting worked, cmap subtable format 4 ended up with size 63984 < 65535, so it worked.

Conversely, the number of glyphs increased by about 3000 but that's ok. Anyway, the final font (with CJK + everything else) contains 57,474 glyphs. pyftmerge succeeded. I will probably make a pull request tomorrow.

Thanks for your help!