Open xianpingge opened 7 years ago
Yeah, we should probably prefer microsoftUCS4Encoding.
An alternative is to also accept PID = 0 (Unicode), PSID = 4 (Unicode 2.0, full repertoire, i.e. not restricted to the Basic Multilingual Plane). FWIW, ttx shows me 5 cmap subtables, not 4, for HanaMinB.ttf.
Also, the code as is prefers Unicode to Microsoft cmap encodings, but I can't remember the reason why, and maybe we don't need to. We should probably prefer cmap format 12 tables over cmap format 4, though, for the greater (non-BMP) range.
I'm dumping the glyphs from HanaMinB.ttf ( available at
https://osdn.net/frs/redir.php?m=pumath&f=%2Fhanazono-font%2F64385%2Fhanazono-20160201.zip
), where most of the characters are > U+FFFF.
Enclosed please find the output of ttfdump -t cmap HanaMinB.ttf
According to the ttfdump output, this ttf file contains 4 cmap subtables, covering the 4 encodings defined in truetype.go:
unicodeEncoding = 0x00000003 // PID = 0 (Unicode), PSID = 3 (Unicode 2.0) microsoftSymbolEncoding = 0x00030000 // PID = 3 (Microsoft), PSID = 0 (Symbol) microsoftUCS2Encoding = 0x00030001 // PID = 3 (Microsoft), PSID = 1 (UCS-2) microsoftUCS4Encoding = 0x0003000a // PID = 3 (Microsoft), PSID = 10 (UCS-4)
And the current code selects the first one (unicodeEncoding):
pidPsid := u32(table, offset) // We prefer the Unicode cmap encoding. Failing to find that, we fall // back onto the Microsoft cmap encoding. if pidPsid == unicodeEncoding { bestOffset, bestPID, ok = offset, pidPsid>>16, true break } else if pidPsid == microsoftSymbolEncoding || pidPsid == microsoftUCS2Encoding || pidPsid == microsoftUCS4Encoding { bestOffset, bestPID, ok = offset, pidPsid>>16, true // We don't break out of the for loop, so that Unicode can override Microsoft. }
and none of the >U+FFFF characters are available.
Should we prefer microsoftUCS4Encoding to the 16-bit-only unicodeEncoding ?
HanaMinB.ttf-dump-cmap.txt