cmap encoding selection: unicodeEncoding vs. microsoftUCS4Encoding

xianpingge commented 7 years ago

I'm dumping the glyphs from HanaMinB.ttf ( available at

https://osdn.net/frs/redir.php?m=pumath&f=%2Fhanazono-font%2F64385%2Fhanazono-20160201.zip

), where most of the characters are > U+FFFF.

Enclosed please find the output of ttfdump -t cmap HanaMinB.ttf

According to the ttfdump output, this ttf file contains 4 cmap subtables, covering the 4 encodings defined in truetype.go:

unicodeEncoding = 0x00000003 // PID = 0 (Unicode), PSID = 3 (Unicode 2.0) microsoftSymbolEncoding = 0x00030000 // PID = 3 (Microsoft), PSID = 0 (Symbol) microsoftUCS2Encoding = 0x00030001 // PID = 3 (Microsoft), PSID = 1 (UCS-2) microsoftUCS4Encoding = 0x0003000a // PID = 3 (Microsoft), PSID = 10 (UCS-4)

And the current code selects the first one (unicodeEncoding):

pidPsid := u32(table, offset) // We prefer the Unicode cmap encoding. Failing to find that, we fall // back onto the Microsoft cmap encoding. if pidPsid == unicodeEncoding { bestOffset, bestPID, ok = offset, pidPsid>>16, true break } else if pidPsid == microsoftSymbolEncoding || pidPsid == microsoftUCS2Encoding || pidPsid == microsoftUCS4Encoding { bestOffset, bestPID, ok = offset, pidPsid>>16, true // We don't break out of the for loop, so that Unicode can override Microsoft. }

and none of the >U+FFFF characters are available.

Should we prefer microsoftUCS4Encoding to the 16-bit-only unicodeEncoding ?

HanaMinB.ttf-dump-cmap.txt

nigeltao commented 7 years ago

Yeah, we should probably prefer microsoftUCS4Encoding.

nigeltao commented 7 years ago

An alternative is to also accept PID = 0 (Unicode), PSID = 4 (Unicode 2.0, full repertoire, i.e. not restricted to the Basic Multilingual Plane). FWIW, ttx shows me 5 cmap subtables, not 4, for HanaMinB.ttf.

Also, the code as is prefers Unicode to Microsoft cmap encodings, but I can't remember the reason why, and maybe we don't need to. We should probably prefer cmap format 12 tables over cmap format 4, though, for the greater (non-BMP) range.

golang / freetype

cmap encoding selection: unicodeEncoding vs. microsoftUCS4Encoding #44