Hyphens are needed in languages that allow hyphenation

roozbehp commented 9 years ago

Hyphen characters (U+002D and U+2010) are needed in fonts in scripts that use hyphenation. This would help creating a hyphen shape that's most appropriate for the script, and potentially apply kerning and such around the characters.

For example, both Armenian and Ethiopic use hyphens, while the Noto Sans Armenian and Noto Sans Ethiopic don't have any hyphen characters.

In Android, we are temporarily applying a hack to copy the hyphen from the LGC fonts, but that's not sustainable. (See internal bug b/21570828 for more information.)

    from nototools import coverage
    from nototools import fix_khmer_and_lao_coverage as merger

    FONTS = [
        'NotoSansArmenian-Regular.ttf',
        'NotoSansArmenian-Bold.ttf',
        'NotoSerifArmenian-Regular.ttf',
        'NotoSerifArmenian-Bold.ttf',
        'NotoSansEthiopic-Regular.ttf',
        'NotoSansEthiopic-Bold.ttf',
    ]

    HYPHENS = {0x002D, 0x2010}

    for font_name in FONTS:
        lgc_font_name = (font_name.replace('Armenian', '')
                                  .replace('Ethiopic', ''))

        chars_to_add = ((HYPHENS - coverage.character_set(font_name))
            & coverage.character_set(lgc_font_name))

        if chars_to_add:
            merger.merge_chars_from_bank(
                font_name,
                lgc_font_name,
                'with-hyphen/'+font_name,
                chars_to_add)

Assigning to Doug for now to figure out the list of scripts that need hyphens. A good starting point is the list of automatically-hyphenated languages from http://tug.org/svn/texhyphen/trunk/hyph-utf8/tex/generic/hyph-utf8/patterns/txt/.

/cc @raphlinus @brawer @roubert

roozbehp commented 9 years ago

/cc @agustinfz

jungshik commented 9 years ago

@kmansourMT

kmansourMT commented 9 years ago

This call to add hyphens to various character sets seems to be a reversal of past policy.

From: jungshik notifications@github.com<mailto:notifications@github.com> Reply-To: googlei18n/noto-fonts reply@reply.github.com<mailto:reply@reply.github.com> Date: Wednesday, 7 October 2015 at 11:09 To: googlei18n/noto-fonts noto-fonts@noreply.github.com<mailto:noto-fonts@noreply.github.com> Cc: Kamal Mansour kamal.mansour@monotype.com<mailto:kamal.mansour@monotype.com> Subject: Re: [noto-fonts] Hyphens are needed in languages that allow hyphenation (#524)

@kmansourMThttps://github.com/kmansourMT

— Reply to this email directly or view it on GitHubhttps://github.com/googlei18n/noto-fonts/issues/524#issuecomment-146281236.

marekjez86 commented 9 years ago

It might be a reversal of the past policy (I actually do not know). However, this is a part of a policy to allow "self-contained hyphenation" within a given font set (i.e., if a language supports hyphenation, it should be possible to hyphenate using only the script for that language), therefore we should implement it.

jungshik commented 9 years ago

You might as well call it that way. @kmansourMT

Anyway, hyphen and related characters are a subset of characters we're considering to add to various script fonts. And, the answer to hyphen and related characters is definitive yes as noted in this bug.

We made a candidate list (which is a superset of what we'll actually end up adding to script-specific fonts) and are soliciting feedback from you and others at MTI.

dougfelt commented 8 years ago

Looking at both the TUG data and the CLDR exemplar data, these are the scripts that according to one or the other use hyphen. Where the noto font for the script (Naskh for Arabic, LGC for Latin, Greek, Cyrillic) does not have hyphen, this is called out. The source and language of the data supporting the use of hyphen is listed under the script. In the case of TUG data, this is just the language tag used to name the file; my assumption was that the mere presence of a file in this directory is enough.

scripts using hyphen Arab: * noto font missing hyphen * common: (2) ar, fa exemplars: (3) ckb_Arab, kby_Arab, ku_Arab Armn: * noto font missing hyphen * tug: (1) hy Beng: tug: (2) as, bn Copt: * noto font missing hyphen * tug: (1) cop Cyrl: tug: (5) bg, mn, ru, sr, uk common: (9) bg, kk, ky, mk, mn, os, ru, sr, uk seed: (2) ce, cu Deva: tug: (3) hi, mr, sa common: (2) hi, mr Ethi: * noto font missing hyphen * common: (1) am exemplars: (3) bcq_Ethi, drs_Ethi, kxc_Ethi Geor: * noto font missing hyphen * tug: (1) ka common: (1) ka Grek: tug: (2) el, grc common: (1) el Gujr: tug: (1) gu common: (1) gu Guru: tug: (1) pa common: (1) pa Hans: * noto font missing hyphen * common: (1) zh Hant: * noto font missing hyphen * common: (1) zh_Hant Hebr: * noto font missing hyphen * common: (2) he, yi Jpan: * noto font missing hyphen * common: (1) ja Khmr: common: (1) km Knda: tug: (1) kn common: (1) kn Kore: * noto font missing hyphen * common: (1) ko Latn: tug: (40) af, ca, cs, cy, da, de, en, eo, es, et, eu, fi, fr, fur, ga, gl, hr, hsb, hu, ia, id, is, it, la, lt, lv, nb, nl, nn, pl, pms, pt, rm, ro, sk, sl, sv, tk, tr, zh_Latn common: (45) af, ast, az, bs, ca, cs, cy, da, de, dsb, ee, en, eo, es, fi, fr, fy, gd, gl, hr, hsb, hu, id, is, it, jgo, ksh, lb, lkt, lt, lv, mt, nb, nl, pl, pt, pt_PT, ro, sk, sr_Latn, sv, to, tr, uz, vi seed: (4) ken, prg, vo, wa exemplars: (1) knf_Latn Mlym: tug: (1) ml common: (1) ml Orya: tug: (1) or Taml: tug: (1) ta common: (1) ta Telu: tug: (1) te common: (1) te Thai: * noto font missing hyphen * tug: (1) th common: (1) th Tibt: * noto font missing hyphen * common: (1) dz Zzzz: tug: (4) kmr, la-x, mul, sh common: (1) root

marekjez86 commented 8 years ago

Thank you @dougfelt for creating this exhaustive list.

re: Hans: * noto font missing hyphen * common: (1) zh Hant: * noto font missing hyphen * common: (1) zh_Hant

I do not have an expert level knowledge of Chinese, but somehow I doubt that Chinese in either version will use hyphen for hyphenation. IMHO, all you need is to break it on a word boundary given that words there are short. However, I think we should be consistent and all fonts should include hyphen.

On Thu, Dec 10, 2015 at 7:27 PM, dougfelt notifications@github.com wrote:

Looking at both the TUG data and the CLDR exemplar data, these are the scripts that according to one or the other use hyphen. Where the noto font for the script (Naskh for Arabic, LGC for Latin, Greek, Cyrillic) does not have hyphen, this is called out. The source and language of the data supporting the use of hyphen is listed under the script. In the case of TUG data, this is just the language tag used to name the file; my assumption was that the mere presence of a file in this directory is enough.

scripts using hyphen Arab: * noto font missing hyphen * common: (2) ar, fa exemplars: (3) ckb_Arab, kby_Arab, ku_Arab Armn: * noto font missing hyphen * tug: (1) hy Beng: tug: (2) as, bn Copt: * noto font missing hyphen * tug: (1) cop Cyrl: tug: (5) bg, mn, ru, sr, uk common: (9) bg, kk, ky, mk, mn, os, ru, sr, uk seed: (2) ce, cu Deva: tug: (3) hi, mr, sa common: (2) hi, mr Ethi: * noto font missing hyphen * common: (1) am exemplars: (3) bcq_Ethi, drs_Ethi, kxc_Ethi Geor: * noto font missing hyphen * tug: (1) ka common: (1) ka Grek: tug: (2) el, grc common: (1) el Gujr: tug: (1) gu common: (1) gu Guru: tug: (1) pa common: (1) pa Hans: * noto font missing hyphen * common: (1) zh Hant: * noto font missing hyphen * common: (1) zh_Hant Hebr: * noto font missing hyphen * common: (2) he, yi Jpan: * noto font missing hyphen * common: (1) ja Khmr: common: (1) km Knda: tug: (1) kn common: (1) kn Kore: * noto font missing hyphen * common: (1) ko Latn: tug: (40) af, ca, cs, cy, da, de, en, eo, es, et, eu, fi, fr, fur, ga, gl, hr, hsb, hu, ia, id, is, it, la, lt, lv, nb, nl, nn, pl, pms, pt, rm, ro, sk, sl, sv, tk, tr, zh_Latn common: (45) af, ast, az, bs, ca, cs, cy, da, de, dsb, ee, en, eo, es, fi, fr, fy, gd, gl, hr, hsb, hu, id, is, it, jgo, ksh, lb, lkt, lt, lv, mt, nb, nl, pl, pt, pt_PT, ro, sk, sr_Latn, sv, to, tr, uz, vi seed: (4) ken, prg, vo, wa exemplars: (1) knf_Latn Mlym: tug: (1) ml common: (1) ml Orya: tug: (1) or Taml: tug: (1) ta common: (1) ta Telu: tug: (1) te common: (1) te Thai: * noto font missing hyphen * tug: (1) th common: (1) th Tibt: * noto font missing hyphen * common: (1) dz Zzzz: tug: (4) kmr, la-x, mul, sh common: (1) root

— Reply to this email directly or view it on GitHub https://github.com/googlei18n/noto-fonts/issues/524#issuecomment-163827227 .

Marek Z Jeziorek [ 老马 ] | marekj@google.com | 312 725-6958

dougfelt commented 8 years ago

Support for that is based on CLDR, which is not always reliable, and might have a different standard for inclusion than we do. They have lots of punctuation listed for zh, including three at-signs, three asterisks, two ampersands, lots of brackets... but, for example, no ASCII digits.

Basically this is just a list of scripts to investigate further to see if they might require hyphen.

I'm reluctant to say 'just include hyphen everywhere' without having a rationale for which characters we should also 'include everywhere'. Right now the only such characters are null, line feed, and space, which are easy since they have no outlines.

On Thu, Dec 10, 2015 at 9:49 PM, Marek Jeziorek notifications@github.com wrote:

Thank you @dougfelt for creating this exhaustive list.

re: Hans: * noto font missing hyphen * common: (1) zh Hant: * noto font missing hyphen * common: (1) zh_Hant

I do not have an expert level knowledge of Chinese, but somehow I doubt that Chinese in either version will use hyphen for hyphenation. IMHO, all you need is to break it on a word boundary given that words there are short. However, I think we should be consistent and all fonts should include hyphen.

On Thu, Dec 10, 2015 at 7:27 PM, dougfelt notifications@github.com wrote:

Looking at both the TUG data and the CLDR exemplar data, these are the scripts that according to one or the other use hyphen. Where the noto font for the script (Naskh for Arabic, LGC for Latin, Greek, Cyrillic) does not have hyphen, this is called out. The source and language of the data supporting the use of hyphen is listed under the script. In the case of TUG data, this is just the language tag used to name the file; my assumption was that the mere presence of a file in this directory is enough.

scripts using hyphen Arab: * noto font missing hyphen * common: (2) ar, fa exemplars: (3) ckb_Arab, kby_Arab, ku_Arab Armn: * noto font missing hyphen * tug: (1) hy Beng: tug: (2) as, bn Copt: * noto font missing hyphen * tug: (1) cop Cyrl: tug: (5) bg, mn, ru, sr, uk common: (9) bg, kk, ky, mk, mn, os, ru, sr, uk seed: (2) ce, cu Deva: tug: (3) hi, mr, sa common: (2) hi, mr Ethi: * noto font missing hyphen * common: (1) am exemplars: (3) bcq_Ethi, drs_Ethi, kxc_Ethi Geor: * noto font missing hyphen * tug: (1) ka common: (1) ka Grek: tug: (2) el, grc common: (1) el Gujr: tug: (1) gu common: (1) gu Guru: tug: (1) pa common: (1) pa Hans: * noto font missing hyphen * common: (1) zh Hant: * noto font missing hyphen * common: (1) zh_Hant Hebr: * noto font missing hyphen * common: (2) he, yi Jpan: * noto font missing hyphen * common: (1) ja Khmr: common: (1) km Knda: tug: (1) kn common: (1) kn Kore: * noto font missing hyphen * common: (1) ko Latn: tug: (40) af, ca, cs, cy, da, de, en, eo, es, et, eu, fi, fr, fur, ga, gl, hr, hsb, hu, ia, id, is, it, la, lt, lv, nb, nl, nn, pl, pms, pt, rm, ro, sk, sl, sv, tk, tr, zh_Latn common: (45) af, ast, az, bs, ca, cs, cy, da, de, dsb, ee, en, eo, es, fi, fr, fy, gd, gl, hr, hsb, hu, id, is, it, jgo, ksh, lb, lkt, lt, lv, mt, nb, nl, pl, pt, pt_PT, ro, sk, sr_Latn, sv, to, tr, uz, vi seed: (4) ken, prg, vo, wa exemplars: (1) knf_Latn Mlym: tug: (1) ml common: (1) ml Orya: tug: (1) or Taml: tug: (1) ta common: (1) ta Telu: tug: (1) te common: (1) te Thai: * noto font missing hyphen * tug: (1) th common: (1) th Tibt: * noto font missing hyphen * common: (1) dz Zzzz: tug: (4) kmr, la-x, mul, sh common: (1) root

— Reply to this email directly or view it on GitHub < https://github.com/googlei18n/noto-fonts/issues/524#issuecomment-163827227

.

Marek Z Jeziorek [ 老马 ] | marekj@google.com | 312 725-6958

— Reply to this email directly or view it on GitHub https://github.com/googlei18n/noto-fonts/issues/524#issuecomment-163846921 .

dougfelt commented 8 years ago

@roozbeh, Armenian has a dedicated hyphen character at 058A. Are you saying that it needs a standard hyphen (and hyphen-minus) as well? Would those map to the same glyph as the Armenian hyphen, or not?

dougfelt commented 8 years ago

@roozbeh, I also can't find examples of hyphen used in Ethiopic. I guess this is for other languages that use the script?

moyogo commented 8 years ago

@dougfelt in Amharaic, see https://am.wikipedia.org/wiki/1_%E1%8A%A5%E1%88%BD%E1%88%98-%E1%8B%B3%E1%8C%8B%E1%8A%95 for example

xiangyexiao commented 8 years ago

What makes hyphen characters (U+002D and U+2010) different? We rely on LGC font (Roboto) and Symbol font (Noto Symbol) for many common characters and symbols that are used across languages. Why we can't fallback to Roboto and Noto Symbol for hyphen?

@roozbehp @dougfelt @jungshik

xiangyexiao commented 8 years ago

I understand better spacing between punctuation and a script-specific character can be achieved by adding punctuations into script-specific fonts, but wondering why hyphen characters (U+002D and U+2010) are called out here for Priority-Critical? There are many other punctuations (e.g., brackets) that should be added to script-specific fonts to improve spacing.

In Phase III, we do plan to add common punctuations into fonts of many scripts that can solve this issue. However, it is not currently planned for Phase II. I'd like to understand the reason of high priority here.

roozbehp commented 8 years ago

The reasons this is marked critical is that Android requires this, or it can't hyphenate text in the languages. So, any update to Noto fonts that are used for automatically-hyphenated languages would need to include the hyphens. If this doesn't happen (either through actual addition of hyphens, or by running the code in my original comment for the "make android" target), we can't take the Noto updates.

So, the two characters U+002D and U+2010 are special because Android's automatic hyphenation code needs them.

roozbehp commented 8 years ago

Answering @dougfelt's comments:

On Ethiopic, Android is automatically hyphenating all words written in any language written in the Ethiopic script: There are hyphenations patterns for the script at http://tug.org/svn/texhyphen/trunk/hyph-utf8/tex/generic/hyph-utf8/patterns/txt/

For Armenian text, we do the same (automatic hyphenation) based on patterns available from the same location as above. We have been looking into automatically inserting the Armenian hyphen, but we received feedback that the present preference is using the normal hyphen.

xiangyexiao commented 8 years ago

https://github.com/googlei18n/noto-fonts/issues/524#issuecomment-189054954

@roozbehp sorry for spending extra time of you. Could you help me further understand why the the two characters U+002D and U+2010 in Roboto doesn't work for Android's automatic hyphenation code? I thought it doesn't matter as long as some font in the fallback chain supports them, no?

roozbehp commented 8 years ago

@xiangyexiao, unfortunately the Android hyphenation code in Minikin separates text into font runs before trying to hyphenate them, so the hyphen character should come from the same font. Changing that (so that we would separate a word into three font runs) is much more risky than running the simple script I provided for the various scripts that need the hyphen character.

TidharC commented 8 years ago

Any updates on this issue?

marekjez86 commented 8 years ago

@dougfelt : does noto_lint catch languages that should offer hyphenation but do not have hyphens?

dougfelt commented 8 years ago

@marekjez86 for phase 3 it would.

marekjez86 commented 7 years ago

As of 30/Nov/2017 the following scripts support hyphen: NotoNastaliqUrdu NotoSans NotoSans-Italic NotoSansArabic NotoSansArabicUI NotoSansArmenian NotoSansBengali NotoSansCham NotoSansCoptic NotoSansDevanagari NotoSansDisplay NotoSansDisplay-Italic NotoSansEthiopic NotoSansGeorgian NotoSansHebrew NotoSansKaithi NotoSansKayahLi NotoSansKharoshthi NotoSansKhmer NotoSansKhmerUI NotoSansLinearB NotoSansLisu NotoSansMono NotoSansSinhala NotoSansSundanese NotoSansTamil NotoSansThai NotoSansThaiUI NotoSerif NotoSerif-Italic NotoSerifArmenian NotoSerifDisplay NotoSerifDisplay-Italic NotoSerifEthiopic NotoSerifGeorgian NotoSerifGujarati NotoSerifGurmukhi NotoSerifHebrew NotoSerifKhmer NotoSerifSinhala NotoSerifTamil NotoSerifThai

marekjez86 commented 7 years ago

@waksmonskiMT , @JelleBosmaMT , @kmansourMT :

We still need to support hyphen in (at least that's what Doug thinks based on the tags): Kannada Malayalam Oriya Telugu Tibetan

eroux commented 6 years ago

I don't really see why Tibetan is marked as needing an hyphen... it looks like a mistake, if not I'd be curious to know what is meant by hyphenation in Tibetan?

dougfelt commented 6 years ago

As I said much previously, this was based on CLDR and TUG data, not on any independent investigation on my part. If no one can find examples of languages written in these scripts that require hyphen, and Android doesn't require it, then these don't need. Monotype should flag this as not required and we can change lint and the cmap list to reflect that.

eroux commented 6 years ago

Well, I can open a ticket on the CLDR tracker to ask the question, I can see above:

Tibt: *** noto font missing hyphen ***
common: (1) dz

I guess it means that cldr indicates that dz requires hyphen but I'm not sure where the information comes from in CLDR... can you give me a hint?

dougfelt commented 6 years ago

I'm not quite sure where the web UI surfaces this data. There are (were) several sets of exemplarCharacters in dz.xml (I'm looking at revision 13686 which might be a bit old), and the 'numbers' and 'punctuation' sets both include hyphen and some other latin punctuation.

CLDR's exemplar data often include characters that are seen in texts that, while predominantly of the target script, occasionally include some latin elements, such as 'international' dates/times/prices, or part codes etc. in manuals/technical materials. These data might reflect that.

eroux commented 6 years ago

I see! Rev. 13869 has \u2010 in the <exemplarCharacters type="punctuation"> element too. I have to admit it's quite mysterious... I don't know if this is worth a separate issue, but there are a few Chinese punctuation characters that are used a lot in modern books edited in China in Tibetan script and do not appear in CLDR's bo.xml, for instance 《, 》, 〈, 〉. I'll open issues on CLDR, although my experience is that these are generally ignored (at least the ones related to Tibetan)...

marekjez86 commented 5 years ago

fixed for Telugu in the new delivery

marekjez86 commented 5 years ago

fixed for Serif Devanagari in https://github.com/googlefonts/noto-fonts/tree/master/phaseIII_only/unhinted/ttf/NotoSerifDevanagari (Added U+2010)

marekjez86 commented 5 years ago

u+002D, u+00AD and u+2010 are present in Telugu

marekjez86 commented 5 years ago

fixed a while ago in https://github.com/googlefonts/noto-fonts/tree/master/phaseIII_only/unhinted/otf/NotoSansArmenian and https://github.com/googlefonts/noto-fonts/tree/master/phaseIII_only/unhinted/otf/NotoSerifArmenian

marekjez86 commented 4 years ago

fixed in https://github.com/googlefonts/get-noto/tree/master/unhinted/Gujarati

notofonts / arabic

Hyphens are needed in languages that allow hyphenation #44