google / fonts

Font files available from Google Fonts, and a public issue tracker for all things Google Fonts
https://fonts.google.com
17.99k stars 2.6k forks source link

Missing characters for Hong Kong in Google Font's Noto Sans TC subset #396

Open hfhchan opened 7 years ago

hfhchan commented 7 years ago

There are several commonly used characters which are missing in Google Font's Noto Sans TC subset.

Examples include: 奬 (U+596C), preferred form in 常用字字形表 versions 2000 and earlier, prevalence rate of 0.0077‰ 戥 (U+6225), verb meaning to feel for someone, prevalence rate of 0.0061‰ 擸 (U+64F8), verb meaning to take or to grab, with an prevalence rate of 0.0039‰ 捭 (U+636D), used in the phrase "捭闔縱橫", with an prevalence rate of 0.0039‰ 啩 (U+5569), sentence-final particle to indicate doubt, with an prevalence rate of 0.0038‰ 説 (U+8AAC), preferred form according to 常用字字形表 (all versions), with an prevalence rate of 0.0036‰ 舁 (U+8201), (formal) verb meaning to move (a dead body, a coffin), with an prevalence rate of 0.0033‰ 仵 (U+4EF5), (formal) "仵工" person in charge of removing dead bodies, with an prevalence rate of 0.0029‰ 揈 (U+63C8), verb meaning to hang out, to swing, with an prevalence rate of 0.0025‰ 脗 (U+8117), (formal) verb meaning match (as in forensics, the DNA collected at the scene matched the suspect), with an prevalence rate of 0.0025‰. 𠝹 (U+20779), verb meaning to cut, with an prevalence rate of 0.0016‰ 瀡 (U+7021), verb meaning to slide down, with an prevalence rate of 0.0016‰ 趷 (U+8DB7), verb meaning popping/poking out of an even surface; in the phrase "趷籮" which refers to having thrusted the hips outward ( --> twerking), with an prevalence rate of 0.0014‰ 掹 (U+63B9), verb meaning to pull or yank, 0.0014‰ 孻 (U+5B7B), noun, meaning younger family member, 0.0014‰ 釿 (U+91FF), noun, axe, used in names, 0.0012‰ 麖 (U+9E96), noun, a kind of deer frequently found in Hong Kong, 0.0011‰ 睩 (U+7769), verbs, used in phrase "眼仔睩睩" (cute baby eyes), 0.0010‰ 𠻹 (U+20EF9), sentence-final particle to indicate exclamation, 0.0009‰ 𨋢 (U+282E2), noun, lift / elevator, 0.0009‰ 骹 (U+9AB9), noun, joint (as in shoulder joint), 0.0009‰ 摷 (U+6477), verb, rummage, 0.0008‰ 忟 (U+5FDF), verb, angry, 0.0008‰ 鈪 (U+922A), verb, bracelet, 0.0008‰ 舸 (U+8238), noun, big ship, used in names, 0.0008‰ 鯭 (U+9BED), noun, fish, or in the phrase "泥鯭的" (taxi-sharing service), 0.0008‰ 搲 (U+6432), verb, to scratch an itch, 0.0008‰ 咇 (U+5487), verb, to be squeezed out of, 0.0006‰ 飊 (U+98CA), (formal) verb, fast, "飊升" (exponentially increase), 0.0006‰ 㩒 (U+3A52), verb, to press, 0.0005‰ 裇 (U+88C7), noun, shirt, 0.0005‰ 湋 (U+6E4B), noun, used in names, 0.0005‰ 鈈 (U+9208), noun, plutonium, as in the nuclear radioactive stuff, 0.0005‰ 𢭃 (U+22B43), verb, to play (with children), to touch one's chin, to flip something upwards, 0.0005‰ 彅 (U+5F45), noun, used in names, 0.0005‰ 脱 (U+8131), preferred form of 脫 according to 常用字字形表 (all versions), with an prevalence rate of 0.0005‰ 郜 (U+90DC), noun, place, used in names, 0.0005‰ 蹚 (U+8E5A), verb, to tread into/over, 0.0004‰ 𥄫 (U+2512B), verb, to stare, 0.0004‰ 泆 (U+6CC6), noun, used in names, 0.0004‰ 税 (U+7A0E), preferred form according to 常用字字形表 (all versions), with an prevalence rate of 0.0004‰ 噃 (U+5643), sentence-final particle to indicate evidence or contrast, with an prevalence rate of 0.0004‰ 溋 (U+6E8B), used in names, 0.0004‰ 鏸 (U+93F8), used in names, 0.0004‰

Incidence rates are based on a character survey conducted on one year archive of on.cc, one of the region's most popular two newspapers, covering 2015/03/21 to 2016/03/20 over 港聞、國際新聞、香港評論、香港娛樂、香港財經 and 香港生活 sections.

For a comparison, the character 筷, meaning chopstick, has an prevalence rate of 0.0025‰, so the listed characters are extremely common. 跆, for taekwondo, has an prevalence rate of 0.0009‰.

Incidence rate lower than 0.0004‰ but still worth a mention: 閲 (U+95B2), preferred form in 常用字字形表 (all versions) for 閱, pending horizontal extension to UCS in HKSCS-2016 吿 (U+543F), preferred form in 常用字字形表 (all versions) for 告, pending horizontal extension to UCS in HKSCS-2016 兑 (U+5151), preferred form in 常用字字形表 (all versions) for 兌, pending horizontal extension to UCS in HKSCS-2016 藴 (U+85F4), preferred form in 常用字字形表 (all versions) for 蘊, pending horizontal extension to UCS in HKSCS-2016 醖 (U+9196), preferred form in 常用字字形表 (all versions) for 醞, pending horizontal extension to UCS in HKSCS-2016 鋭 (U+92ED), preferred form in 常用字字形表 (all versions) for 銳, pending horizontal extension to UCS in HKSCS-2016 丢 (U+4E22), variant form in 常用字字形表 (all versions) for 丟

𦉘 (H-9DBC), noun, generic term for big cooking pot 椗 (U+6917), noun, generic term for stalk of a fruit, "慈姑椗" meaning penis, metaphor for males 𦧲 (U+269F2), verb, to spit, to beg, to pester 𤓓 (U+244D3), adjective, generic term for a burning smell 鰵 (U+9C35), 金錢鰵 a fish prized for its fishmaw 啅 (U+5545), 啅頭 gimmick, 翻啅 to eat something again 襇 (U+8947), 褶襇 (formal) pleat 拃 (U+62C3), verb, to grad, 掗拃, bulky 啹 (U+5579), 啹喀, Gurkhas, a racial minority in Hong Kong, many served in the army in WWII 𦟌 (U+267CC), shank, shin (of beef or pork) 抦 (U+62A6), to hit, to pound 揦 (U+63E6), to grab, to squish the face in dislike 飇 (U+98C7), variant form of 飊 嗮 (U+55EE), adverb meaning to full extent, completely

膦 (U+81A6), phosphonomethyl, 草甘膦, glyphosate, a weed-killer 酞 (U+915E), phthaleins, 酞菁 phthalocyanine 肼 (U+80BC), hydrazine, 米曲肼 (meldonium), banned drug in sports 氘 (U+6C18), deuterium. (interestingly, 氚 for tritium is included.)

腼 (U+817C), (formal) 腼腆, shy 飱 (U+98F1), (formal) used in proverb "誰知盤中飧 粒粒皆辛苦", 犂 (U+7282), a kind of farming tool, phrase "犂庭掃穴" the codename of a police operation 掮 (U+63AE), used in "披掮", noun, shawl 鱇 (U+9C47), used in 鮟鱇, anglerfish 璈 (U+7488), used in 瑧璈, 璈珀, name of two expensive residential blocks in Hong Kong 瑆 (U+7446), used in 瑆華, name of an residential block in Hong Kong

駹 (U+99F9), a type of horse, used in names 䮎 (U+4B8E), a type of horse, used in names 琋 (U+740B), a type of jade, used in names 睎 (U+774E), used in names 㷧 (U+3DE7), used in names, name of ex-chairperson of ESPRIT, a publicly listed company 廼 (U+5EFC), used in names, "甘廼迪" for John F. Kennedy 錤 (U+9324), used in names


Noto Sans TC has been subsetted to the most frequent 7,800 Chinese characters in Traditional Chinese documents. 223 characters are added to cover all the characters in Taiwan's CNS 11643 P1 and 常用國 字標準字體表 as well as Hong Kong's 常用字字形表 and IRG HB0 and HB1. In addition to Hanzi, Bopomofo, CJK Radicals, ASCII, punctuation marks and full-width characters are included. The full version can be downloaded in the link below. For more details, see Noto CJK Help.

Hong Kong's 常用字字形表 covers frequently used characters for primary schools, which are mainly characters used in formal writing (or mandarin Chinese). They don't cover frequently used characters used for Cantonese. HB0 / HB1 is Big-5 Section A and B which is basically the same set as CNS 11643 P1 / P2.

Take note that Big-5 and CNS 11643 P1/P2 only cover characters based on their usage in mandarin Chinese, and the character set were based on data collected from Taiwanese newspapers in 1976. This set is nowhere representative for the Taiwan and Hong Kong people in the modern context.


Some characters where their actual use is low and official status is discouraged/non-canonical can be removed: 银 (U+94F6), simplified Chinese form for 銀 装 (U+88C5), simplified Chinese form for 裝 优 (U+4F18), simplified Chinese form for 優. Also a traditional chinese character meaning white, but such use is archaic. 样 (U+6837), simplified Chinese form for 樣. Also a traditional chinese character meaning a column used for hanging silkworms, but such use is archaic. 痒 (U+75D2), simplified Chinese form for 癢, or variant form for 瘍, but such use is archaic. 随 (U+968F), simplified Chinese form for 随 艺 (U+827A), simplified Chinese form for 藝 动 (U+52A8), simplified Chinese form for 動 栄 (U+6804), simplified Chinese form for 榮 众 (U+4F17), simplified Chinese form for 眾 庆 (U+5E86), simplified Chinese form for 慶 亚 (U+4E9A), simplified Chinese form for 亞 总 (U+603B), simplified Chinese form for 總 萦 (U+8426), simplified Chinese form for 縈 礼 (U+793C), simplified Chinese form for 禮, or archaic form for 禮. 渊 (U+6E0A), simplified Chinese / Japanese form for 淵. 稲 (U+7A32), Japanese form for 稲 総 (U+7DCF), Japanese form for 總 聡 (U+8061), Japanese form for 聰 錬 (U+932C), Japanese form for 鍊 齢 (U+9F62), Japanese form for 齡

睌 (U+774C), used in "睌䁂", archaic. 亁 (U+4E81), variant of 乾, uncommon. 廾 (U+5EFE), archaic, rare.

For reasons unknown, the subsetting seems to have included most of the simplified Chinese characters encoded inside HKSCS (more specifically, those in IICORE and HKSCS at the same time, regardless of existence of "H" flag in IICORE), but actually failed to include commonly used characters in Hong Kong.


originally posted in googlei18n/noto-cjk#77

davelab6 commented 7 years ago

@hfhchan please could you confirm if these characters are missing from the fonts in www.google.com/get/noto as well? If so, I can update the fonts available from fonts.google.com/earlyaccess but if they are also missing there, the original issue post is where this can be resolved :)

hfhchan commented 7 years ago

They all exist. Tested with Noto Sans TC Black.

Interestingly, the 14 characters quoted by @kenlunde in https://github.com/googlei18n/noto-cjk/issues/77#issuecomment-251847570 seem to exist also. Is it that Noto Sans TC is built with the SHS(CN) glyphs as well?

davelab6 commented 7 years ago

OK, so this issue can be resolved by updating the fonts in early access?

hfhchan commented 7 years ago

Yes.

kenlunde commented 7 years ago

@hfhchan: Glyphs for the 14 characters that I pointed out in the referenced issue are not present on the Noto Sans CJK region-specific subset fonts, and if you are seeing glyphs for them, it is the result of font fallback.

hfhchan commented 7 years ago

𡃶 (U+210F6) Cantonese for kissing is also missing.

Use on Facebook: https://www.google.com.hk/search?q=%F0%A1%83%B6%20site%3Afacebook.com Use on Apple Daily, one of the two most popular printed newspapers in circulation in Hong Kong https://www.google.com.hk/search?q=site:hk.apple.nextmedia.com 𡃶 Use on TVMost, a very popular media site in Hong Kong, especially among youth https://www.google.com.hk/search?q=site:www.tvmost.com.hk/ 𡃶

錫 or o錫 is often used in replacement for being unable to type the Extension B character.

kenlunde commented 7 years ago

𡃶 U+210F6 is in the official Traditional Chinese subset, so some other process is stripping it out.

hfhchan commented 7 years ago

None of those I have mentioned are in Google's subset. (https://fonts.google.com/earlyaccess).

First, I included the CSS file in a webpage, combined with Adobe Blank as fallback font

<html><head><link href="http://fonts.googleapis.com/earlyaccess/notosanstc.css" rel="stylesheet"></head><body style="
    font: 200px Noto Sans TC, Adobe Blank;
">玄磁𡃶石清你好奬戥擸捭啩説舁仵揈脗
𠝹瀡趷掹孻釿麖睩𠻹𨋢骹摷忟鈪舸鯭搲咇飊㩒裇湋鈈𢭃彅脱郜蹚𥄫泆税噃溋鏸閲兑藴醖鋭丢𦉘椗𦧲𤓓鰵啅襇拃啹𦟌抦揦飇嗮膦酞肼氘腼飱犂掮鱇璈瑆駹䮎琋睎㷧廼錤原
</body></html>

Yields the following result on Google Chrome:

image


Second, I tested using Noto Sans TC Black downloaded here... image

and got this in Microsoft Word 2013: image

hfhchan commented 7 years ago

Another one is 𡃴 (U+210F4), for smell/stink. The problem with 𡃴 is that most of the information using it is being encoded using PUA which makes it very hard to index.

An example is being used here, on one of the city's prestigious (but less popular nowadays) newspapers: http://news.mingpao.com/pns/a/web_tc/article/20150923/s00005/1442945448331 which says 「香港 (U+F457)」 when it should be encoded as 「香港𡃴 (U+210F4)」.

This example wasn't found on Google's index. It was found only because someone copied the article and posted it somewhere else while correcting the PUA characters.

Another example, where it is coded correctly via SIP, is on one of the city's two most popular printed media: http://hk.apple.nextmedia.com/supplement/culture/art/20160729/19713609 「聞過女人𡃴心思思有件事呀?」

kenlunde commented 7 years ago

The main thing to understand is that if a character is in HKSCS-2008 (or its Big Five subset), it is included in the official Traditional Chinese subset of Noto Sans CJK, though the glyph may not completely conform to HK guidelines (that's a Version 2.000 thingie). I cannot speak to the apparent further subsetting that is referenced in this issue.

hfhchan commented 5 years ago

@davelab6 Now that Source Han Sans HK is released, would there be any update to this issue? The aforementioned characters are present in Source Han Sans/Noto Sans TC since the very beginning, but are missing from Noto Sans TC subset served via Google Fonts. These characters are now isolated to the Source Han Sans HK subset and removed from the Taiwan subset; would the splitting of TW also be carried out for Noto?

The continued lack of Cantonese characters makes Noto Sans TC served via Google Fonts pretty useless for Hong Kong users, which constitute the second largest majority for Traditional Chinese users, because so many Cantonese characters are missing. A large media outlet in Hong Kong, HK01, had used Noto Sans TC (with the medium weight) webfont served from Google Fonts but has since dropped so because many characters would fall back to the system default and render in a light serif font.

tamcy commented 4 years ago

It seems that all of the characters, with 駹 being the only exception, are now included in the latest version of Noto Sans HK webfont.

 ----------- ----------- ----- 
  Codepoint   Character   Subset file
 ----------- ----------- ----- 
  U+596C      奬          76   
  U+6225      戥          70   
  U+64F8      擸          68   
  U+636D      捭          69   
  U+5569      啩          79   
  U+8AAC      説          101  
  U+8201      舁          46   
  U+4EF5      仵          84   
  U+63C8      揈          69   
  U+8117      脗          46   
  U+20779     𠝹          9    
  U+7021      瀡          59   
  U+8DB7      趷          38   
  U+63B9      掹          69   
  U+5B7B      孻          74   
  U+91FF      釿          35   
  U+9E96      麖          26   
  U+7769      睩          54   
  U+20EF9     𠻹          9    
  U+282E2     𨋢          1    
  U+9AB9      骹          28   
  U+6477      摷          68   
  U+5FDF      忟          72   
  U+922A      鈪          35   
  U+8238      舸          46   
  U+9BED      鯭          27   
  U+6432      搲          68   
  U+5487      咇          80   
  U+98CA      飊          30   
  U+3A52      㩒          87   
  U+88C7      裇          41   
  U+6E4B      湋          61   
  U+9208      鈈          35   
  U+22B43     𢭃          7    
  U+5F45      彅          72   
  U+8131      脱          46   
  U+90DC      郜          36   
  U+8E5A      蹚          37   
  U+2512B     𥄫          4    
  U+6CC6      泆          62   
  U+7A0E      税          52   
  U+5643      噃          78   
  U+6E8B      溋          60   
  U+93F8      鏸          33   
  U+95B2      閲          32   
  U+543F      吿          80   
  U+5151      兑          82   
  U+85F4      藴          43   
  U+9196      醖          35   
  U+92ED      鋭          34   
  U+4E22      丢          84   
  U+26258     𦉘          3    
  U+6917      椗          65   
  U+269F2     𦧲          3    
  U+244D3     𤓓          5    
  U+9C35      鰵          27   
  U+5545      啅          79   
  U+8947      襇          41   
  U+62C3      拃          69   
  U+5579      啹          79   
  U+267CC     𦟌          3    
  U+62A6      抦          69   
  U+63E6      揦          69   
  U+98C7      飇          30   
  U+55EE      嗮          79   
  U+81A6      膦          46   
  U+915E      酞          35   
  U+80BC      肼          47   
  U+6C18      氘          63   
  U+817C      腼          46   
  U+98F1      飱          29   
  U+7282      犂          57   
  U+63AE      掮          69   
  U+9C47      鱇          27   
  U+7488      璈          56   
  U+7446      瑆          56   
  U+4B8E      䮎          84   
  U+740B      琋          56   
  U+774E      睎          54   
  U+3DE7      㷧          86   
  U+5EFC      廼          72   
  U+9324      錤          34   
  U+210F4     𡃴          8    
  U+210F6     𡃶          8    
 ----------- ----------- ----- 

BTW, currently the latest webfont is still based on Noto Sans CJK v2.000. It would be better if it is updated to v2.001.

DragonMeme commented 2 years ago

(U+7232) - 爲 is also missing from both TC and HK subset. It is a commonly used character. Checked in preview text as of the message sent. Screenshot below: image

davelab6 commented 1 month ago

Does https://fonts.google.com/noto/specimen/Noto+Sans+HK fix this?