notofonts / noto-fonts

Noto fonts, except for CJK and emoji
http://fonts.google.com/noto
SIL Open Font License 1.1
2.45k stars 200 forks source link

Script-Language linkage data. #485

Closed c933103 closed 8 years ago

c933103 commented 9 years ago

As on http://www.google.com/get/noto/ , it have stated that language data are taken from Unicode CLDR repository. But how about those languages that do not include in the CLDR? As on the page, it said the following languages use Noto Sans CJK SC: -Cantonese -Gan Chinese -Literary Chinese (Simplified script) -Min Nan Chinese -Simplified Chinese -Wu Chinese -Xiang Chinese -Zhuang (Simplified script) While the following language use Noto Sans CJK TC: -Hakka Chinese -Literary Chinese -Traditional Chinese Which I believe the CLDR data only include Traditional Chinese and Simplified Chinese data.

And the reason I am asking about that is that sone data appears to be problematic, as while some people do write Cantonese and Min Nan Chinese in simplified han script, it's more common for me to see those languages written woth traditional han script, and sometime the Min Nan Chinese is also written with latin alphabet via romanization. And while it is not uncommon to see Hakka Chinese in traditional han script, some would write then in simplified han script too. I am not knowledgeable about Wu/Xiang/Zhuang/Gan enough to say are they simplified script only (or simplified+latin only) but you might want to check that do they write in traditional script.

And for non-han languages, according to Wikipedia, Ainu language also use latin script, and Vietnamese language also use han script.

dougfelt commented 9 years ago

The data is CLDR plus anticipated changes for future releases plus internal tweaks, sometimes made for undocumented reasons. We grab all the supplemental data and use various parts of it to try to figure out what language-script combinations have had significant use (regardless of whether they are currently in use).

When we next update the website it should list Cantonese, Min Nan, and Hakka in both Simplified and Traditional Han.

Anything can be written in romanization, so we tread lightly there. I assume most people who want to write a language in romanization can find Noto Sans/Serif ok. If Min Nan is currently primarily written/read in Latin romanization, then I can add it. If it is primarily written in mixed roman/Han then I probably wouldn't list Sans/Serif unless Min requires accents or marks that Noto Sans CJK doesn't have.

c933103 commented 9 years ago

Both the hakka wikipedia and min nan wikipedia are biiterial (as in, it offer both hanzi edition and romanized edition), (the hakka one have implemented an auto-conversion system, while the min nan wikipedia is currently hosting only romanized content and put all hanzi content onto wikia). Don't know if that is considered primarily enough or not. (or might be you can check hak.wikipedia.org or zh-min-nan.wikipedia.org to see the CJK font is enough or not.)

dougfelt commented 9 years ago

Ok, I'll follow wikipedia's lead on this. I'm going to assume that Noto Sans has all the characters needed for Hakka and Min Nan Chinese in Latin, and list Latn as a script for both languages.

@roozbehp do you have any comments? Currently extra_locale_data has Hans and CN as the default script and region for nan. I'd probably change the default script for nan to Latn, based on wikipedia's apparent preference for it.

It's not clear to me that Noto Sans CJK has all the required characters for writing these languages in Hani. @kenlunde, do you know?

kenlunde commented 9 years ago

The Latin glyphs in Noto Sans CJK are specifically designed to cover ASCII (the first half of U+00xx), ISO Latin 1 (the second half of U+00xx), the characters required for common CJK transliteration/transcription, and Vietnamese. I am not familiar with what Latin characters are used for Hakka Chinese and Min Nan Chinese. In terms of the ideographs, there is complete coverage of the URO and Extension A in both Noto Sans CJK SC and Noto Sans CN.

c933103 commented 8 years ago

about the use of POJ(latin) or hanji(han ideography) in Min Nan Chinese, as per their village pump: https://zh-min-nan.wikipedia.org/wiki/Wikipedia:Chhi%C5%AB-%C3%A1-kha#.E7.82.BA.E4.BB.80.E9.BA.BC.E6.BC.A2.E5.AD.97.E6.A2.9D.E7.9B.AE.E8.A6.81.E6.94.BE.E5.9C.A8.E8.A8.8E.E8.AB.96.E9.A0.81.EF.BC.9F It seem like it is a widely deabated topic, and there were already multiple votes initated to change the style of writing used in that wikipedia but just none of them gathered enough consesus to change so it apparently is an topic deeply divided among their community. And as I read from the page, while the major dialect of Min Nan Chinese can be written in both scripts, some dialects are apparently han-only at least in users' daily life.

Min Nan Chinese's POJ use three special characters, including "O͘", "o͘", and "ⁿ" according to wikipedia. (The first two characters are regular "O"/"o" plus U+0358), and "ⁿ" is U+207F. It also use 5 different marking for different tones, including á à â ǎ ā (there are other tones but they use markings with more regular characters/no markings) and these markings would appear on a/o/o͘/e/i/u.

Hakka Chinese POJ also used Ṳ special character and use â á à á å to represent tones.

and on the other hand, Min Dong Chinese Wikipedia said (https://cdo.wikipedia.org/wiki/%E5%B9%AB%E5%8A%A9:Ci%C5%8Fng-i%C3%B4ng_t%C4%95%CC%A4k) special fonts is required to correctly display their content as they use U+0324 and U+1E73 in romanized edition. While Min Dong Chinese is still not supported by Noto, but have these two characters become part of Noto Sans CJK?

kenlunde commented 8 years ago

Noto Sans CJK does not support these special Latin forms.

dougfelt commented 8 years ago

These characters are, though, supported by NotoSans.

It's not clear that 'o' followed by U+0358 would always render properly if Sans and Sans CJK were both available-- the font fallback rules implemented by the rendering system might get the characters from different fonts, and a fallback ordering that worked correctly for some character sequences might fail to work properly for others. It also depends on whether Noto Sans CJK has special uses of the latin characters that would not be available if the latin came from a different font. Ken would know.

kenlunde commented 8 years ago

As long as Noto Sans is higher up in the font fallback chain, that is where the glyphs for these characters would come from.

dougfelt commented 8 years ago

Ken, are you saying that there's no, for example, kerning or alternate forms of the latin glyphs that make them work better with the CJK glyphs and that would be lost if all of Latin came from another font? Seems unlikely there'd be extra support but I hesitated to make that claim myself.

If that's the case then using NotoSans and falling back to NotoCJK would be a viable approach to supporting mixed pages using both Latin and CJK for these languages.

kenlunde commented 8 years ago

For the characters in question, which are Latin proper, what comes from Noto Sans is probably more desirable than what is in Noto Sans CJK, which is derived from Source Sans Pro and lacks the broader coverage.

dougfelt commented 8 years ago

The web site now shows both Min Nan and Hakka as being written in any of Latin, Traditional Chinese, or Simplified Chinese. We also show Cantonese as being written in either Traditional Chinese or Simplified Chinese.

For Min Nan in Latin, Noto Sans or Noto Serif would be required as not all the required Latin characters are supported by the CJK fonts.

Our estimate of whether a language can be written in one of our fonts is based primarily on the character repertoire we think the language requires, which is largely based on exemplar character data from CLDR. However, if a language requires special contextual forms, we might not know enough to provide the GSUB/GPOS support for those forms. Please let us know if you see these kinds of issues.

c933103 commented 8 years ago

For https://github.com/googlei18n/nototools/commit/8db032faa337184e7a08392f75cb10dd9d4001dc#diff-fcf976ca6a91cdc1071b57df5054cae4

I would say L451, 452, and 455 should be

'CN': ['yue-Hans', 'hak-Hans', 'hak-Latn', 'nan-Hans', 'nan-Latn'],
'HK': ['yue-Hant'],
'TW': ['nan-Hant', 'nan-Latn', 'hak-Hant', 'hak-Latn'],

instead of how it is written currently.

and, translation of language name in L371 of https://github.com/googlei18n/nototools/blob/da428598efc6c381b139aca989fdf0863ada7cb1/third_party/cldr/common/main/zh_Hant.xml for language with node MRJ contain a error. Please also check other lines in the file to see if there are any other places with the error too.

And, Searching Central Okinawan on the site no longer return any result?

on the other hand, wikipedia said Nastaʿlīq script is used by user of some languages in China. https://en.wikipedia.org/wiki/Nasta%CA%BFl%C4%ABq_script

dougfelt commented 8 years ago

'CN': ['yue-Hans', 'hak-Hans', 'hak-Latn', 'nan-Hans', 'nan-Latn'],

'HK': ['yue-Hant'],

'TW': ['nan-Hant', 'nan-Latn', 'hak-Hant', 'hak-Latn'],

OK

and, translation of language name in L371 of > https://github.com/googlei18n/nototools/blob/da428598efc6c381b139aca989fdf0863ada7cb1/third_party/cldr/common/main/zh_Hant.xml for language with node MRJ contain a error. Please also check other lines in the file to see if there are any other places with the error too.

I don't think this impacts the web site, since I don't think we pick up the Chinese name for mrj. I didn't see a replacement for the current Chinese name, did you mean to provide one. This is really a CLDR issue, so perhaps it would be better to report this to CLDR directly.

And, Searching Central Okinawan on the site no longer return any result?

Thanks, the tool mistakenly stopped mapping Kana to Japanese, and dropped these.

on the other hand, wikipedia said Nastaʿlīq script is used by user of some languages in China https://en.wikipedia.org/wiki/Nasta%CA%BFl%C4%ABq_script

@kmansourMT, does Noto Nastaliq support Uyghur? The above-referenced page asserts Uyghur is written using nasta'liq style, but I can't find confirmation of this.

kmansourMT commented 8 years ago

Doug, At the start of this project, we agreed on a character set that would be incremented in stages. In the attached document, you will see 3 levels of requirements: Tier1, Tier2, and Tier3. For the current stage, we had agreed to support only the needs of certain South Asian languages.

Kamal

From: dougfelt notifications@github.com<mailto:notifications@github.com> Reply-To: googlei18n/noto-fonts reply@reply.github.com<mailto:reply@reply.github.com> Date: Wednesday, 30 September 2015 at 15:36 To: googlei18n/noto-fonts noto-fonts@noreply.github.com<mailto:noto-fonts@noreply.github.com> Cc: Kamal Mansour kamal.mansour@monotype.com<mailto:kamal.mansour@monotype.com> Subject: Re: [noto-fonts] Script-Language linkage data. (#485)

'CN': ['yue-Hans', 'hak-Hans', 'hak-Latn', 'nan-Hans', 'nan-Latn'],

'HK': ['yue-Hant'],

'TW': ['nan-Hant', 'nan-Latn', 'hak-Hant', 'hak-Latn'],

OK

and, translation of language name in L371 of > https://github.com/googlei18n/nototools/blob/da428598efc6c381b139aca989fdf0863ada7cb1/third_party/cldr/common/main/zh_Hant.xml for language with node MRJ contain a error. Please also check other lines in the file to see if there are any other places with the error too.

I don't think this impacts the web site, since I don't think we pick up the Chinese name for mrj. I didn't see a replacement for the current Chinese name, did you mean to provide one. This is really a CLDR issue, so perhaps it would be better to report this to CLDR directly.

And, Searching Central Okinawan on the site no longer return any result?

Thanks, the tool mistakenly stopped mapping Kana to Japanese, and dropped these.

on the other hand, wikipedia said Nastaʿlīq script is used by user of some languages in China https://en.wikipedia.org/wiki/Nasta%CA%BFl%C4%ABq_script

@kmansourMThttps://github.com/kmansourMT, does Noto Nastaliq support Uyghur? The above-referenced page asserts Uyghur is written using nasta'liq style, but I can't find confirmation of this.

— Reply to this email directly or view it on GitHubhttps://github.com/googlei18n/noto-fonts/issues/485#issuecomment-144565013.

c933103 commented 8 years ago

It should be 馬里 not 馬裹, and the error seems to be caused but automatic conversion from hans to hant. As the page is too large and my computer would crash everytime i load the page, i can't check is there any other similar problem on the page unless there are somewhere else storing thpse data in smaller segment. And also, how to report this to CLDR directly? 2015/10/01 6:36 "dougfelt" notifications@github.com:

'CN': ['yue-Hans', 'hak-Hans', 'hak-Latn', 'nan-Hans', 'nan-Latn'],

'HK': ['yue-Hant'],

'TW': ['nan-Hant', 'nan-Latn', 'hak-Hant', 'hak-Latn'],

OK

and, translation of language name in L371 of > https://github.com/googlei18n/nototools/blob/da428598efc6c381b139aca989fdf0863ada7cb1/third_party/cldr/common/main/zh_Hant.xml for language with node MRJ contain a error. Please also check other lines in the file to see if there are any other places with the error too.

I don't think this impacts the web site, since I don't think we pick up the Chinese name for mrj. I didn't see a replacement for the current Chinese name, did you mean to provide one. This is really a CLDR issue, so perhaps it would be better to report this to CLDR directly.

And, Searching Central Okinawan on the site no longer return any result?

Thanks, the tool mistakenly stopped mapping Kana to Japanese, and dropped these.

on the other hand, wikipedia said Nastaʿlīq script is used by user of some languages in China https://en.wikipedia.org/wiki/Nasta%CA%BFl%C4%ABq_script

@kmansourMT https://github.com/kmansourMT, does Noto Nastaliq support Uyghur? The above-referenced page asserts Uyghur is written using nasta'liq style, but I can't find confirmation of this.

— Reply to this email directly or view it on GitHub https://github.com/googlei18n/noto-fonts/issues/485#issuecomment-144565013 .

davelab6 commented 8 years ago

@kmansourMT the attachment isn't included in Github issue emails. If you visit https://github.com/googlei18n/noto-fonts/issues/485 https://github.com/googlei18n/noto-fonts/issues/485#issuecomment-144565013 then you can edit your comment, and then drag and drop the file into the comment to attach it.

(ref https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/ )

dougfelt commented 8 years ago

It should be 馬里 not 馬裹, and the error seems to be caused but automatic conversion from hans to hant. As the page is too large and my computer would crash everytime i load the page, i can't check is there any other similar problem on the page unless there are somewhere else storing thpse data in smaller segment. And also, how to report this to CLDR directly?

It looks like people are supposed to use the CLDR survey tool. This is new data and it looks like there are differing opinions as to what the right translation is: http://st.unicode.org/cldr-apps/v#/zh_Hant/Languages_T_Z/

It looks like you need to create an account before proposing data changes. The main entry point is here: http://cldr.unicode.org/index/survey-tool

If your computer has trouble with the xml file you might have trouble with the survey tool, I don't know.