rosettatype / hyperglot

Hyperglot: a database and tools for detecting language support in fonts
http://hyperglot.rosettatype.com
GNU General Public License v3.0
162 stars 22 forks source link

The design_requirements for Dutch [nld] are misleading #118

Open moyogo opened 1 year ago

moyogo commented 1 year ago

The design_requirements for Dutch [nld] are misleading: https://github.com/rosettatype/hyperglot/blob/d007599070c768d1b322708e18d70c79e75e249d/lib/hyperglot/hyperglot.yaml#L8189-L8190

The design requirements should say that "The <j> should lose its dot when combined with a combining acute for when the acute on j is not omitted on stressed lange ij, usually spelled íj but íj́ when possible. Generally, fonts should not add an acute that is not there in the text."

1. The characters used are confusing

The current design requirements say that <ij> U+0133 LATIN SMALL LIGATURE IJ is combined with U+0301 COMBINING ACUTE and that <ȷ> U+0237 LATIN SMALL LETTER DOTLESS J should have an acute when following <í> U+00ED LATIN SMALL LETTER I WITH ACUTE.

<ij> U+0069 U+006A should be used instead of <ij> is U+0133 LATIN SMALL LIGATURE IJ as the letter combination i+j U+0069 U+006A is generally used for the lange ij.

See for example Taalunie, Technische Handleiding: Regels voor de officiële spelling van het Nederlands, 2016, p. 19:

De lettercombinatie i+j (lange ij) gedraagt zich soms alsof het om één enkele letter gaat.

See also Unicode, chapter 7:

Another pair of characters, U+0133 latin small ligature ij and its uppercase version, was provided to support the digraph “ij” in Dutch, often termed a “ligature” in discussions of Dutch orthography. When adding intercharacter spacing for line justification, the “ij” is kept as a unit, and the space between the i and j does not increase. In titlecasing, both the i and the j are uppercased, as in the word “IJsselmeer.” Using a single code point might sim- plify software support for such features; however, because a vast amount of Dutch data is encoded without this digraph character, under most circumstances one will encounter an <i, j> sequence.

The ȷ U+0237 LATIN SMALL LETTER DOTLESS J is not used in Dutch, j U+006A LATIN SMALL LETTER J is.

2. Substitution of j after í breaks Dutch text

An automated substitution that replaces j by ȷ with acute after í breaks Dutch text. The Taalunie stress marks spelling rule 5.1 says :

Het klemtoonteken is ´. Als een klinker of tweeklank met twee of meer letters geschreven wordt, krijgen de eerste twee een klemtoonteken. [...] Door technische beperkingen vervalt meestal het nadrukteken op de j van een lange ij. Bijvoorbeeld: blíjven kijken!

Which can be translated as:

The stress mark is ´. If a vowel or diphthong is written with two or more letters, the first two letters get a stress mark. Due to technical limitations, the stress mark on the j is usually omitted from a lange ij. For example: blíjven kijken!

So Dutch text can follow the official spelling rules and omit the acute on the j of stressed lange ij, like in the example provided.

Additionally, this spelling rule was standardized in the 1996 spelling and before that it was common to put the acute only on the first letter of digraphs composed of two different letters. See for example Jan Renkema, Schrijfwijzer, 1987, p. 159. Many Dutch speakers still write and many Dutch texts are written with pre-1996 rules. They use níet instead of níét, góed instead of góéd, zíjn instead of zíj́n, a font should not make either look like they have an additional acute.

There is also the issue of foreign names in Dutch text, like Níjar or Szíj, which would be displayed incorrectly.

kontur commented 1 year ago

Very interesting, thank you @moyogo! On 1) I agree... the use of the jdotless in the example likely stems from a designer centric view where that letter would be the component used to construct the ij with acute. As for 2) this is new to me. I was under the impression that the "technical limitation" should be circumvented when this is possible. So overall this should be an optional recommendation that also mentions the different styles/orthographies?

MrBrezina commented 1 year ago

Sorry for taking so long to get to this. I have a draft which I will push in a moment for your review. It is longer than what you proposed. Hopefully, it helps clarity. What I am still unsure about is this bit where I say:

It is up to the font developers to decide whether they want to treat lange ij as a single unit during tracking or not.

We had some Dutch readers telling us they would prefer for \\ to get tracked and others would insist on keeping it a single unit. This:

De lettercombinatie i+j (lange ij) gedraagt zich soms alsof het om één enkele letter gaat.

says that it can “sometimes” behave like a single unit, hence my recommendation above.

I can see three strategies font developers can take:

  • merge \\ to \<ij> (on a glyph level)
  • dissolve \<ij> to \\ and \<IJ> to \\
  • leave \<ij> and \<IJ> intact and leave that control to users, i.e. they can use <ij> to keep them as a single unit or \\ to track

Each of these then requires a different solution when adding stress on the lange ij. The latter two strategies would work well for multilngual texts.

moyogo commented 1 year ago

We had some Dutch readers telling us they would prefer for <i><j> to get tracked and others would insist on keeping it a single unit. This:

De lettercombinatie i+j (lange ij) gedraagt zich soms alsof het om één enkele letter gaat.

says that it can “sometimes” behave like a single unit, hence my recommendation above.

I should have quoted the whole paragraph from Taalunie, Technische Handleiding, 2016 (it’s actually online: https://taalunie.org/feeds/download/technische-handleiding-2016-5dcab.pdf/Technische%20Handleiding/original):

De lettercombinatie i+j (lange ij) gedraagt zich soms alsof het om één enkele letter gaat. Zo worden i en j aan het begin van een zin of een eigennaam beide als hoofdletter geschreven: IJmuiden, IJzermonding. Ze staan in kruiswoordraadsels vaak samen in één vakje. In naslagwerken of telefoongidsen worden de woorden of namen die ij bevatten, soms onder de letter y gealfabetiseerd. In de meeste woordenboeken is er echter geen sprake van een aparte letter ij, maar wordt ij geplaatst tussen -ii- en -ik-.

The "sometimes" means it behaves like a single letter in some contexts (beginning of sentences and of proper nouns, or sorted like y in some reference works) and like two letters in others (sorted like i+j in most dictionaries). I don’t think the Taalunie was refering to users preference for tracking but doing so as a unit has definitely been the norm historically.

I can see three strategies font developers can take:

  • merge <i> to <ij> (on a glyph level)
  • dissolve <ij> to <i> and <IJ> to <I>
  • leave <ij> and <IJ> intact and leave that control to users, i.e. they can use <ij> to keep them as a single unit or <i> to track

Each of these then requires a different solution when adding stress on the lange ij. The latter two strategies would work well for multilngual texts.

Dissolving <ij> and <IJ> defeats their purpose, at least according to the Unicode paragraph quoted before. Users who want ij kept as a unit can use <ij> and <IJ> safely (unless a font abuses Unicode and dissolves them).

Fonts may provide the same tracking behaviour for <i><j> and <I><J> as a unit, it may be optional or by default, but either way it should be easy to enable or disable. The Taalunie is pretty clear on the "lange ij" being i+j. Having a ligature letterform for <i><j> is great for some or in some display styles and in handwritten styles, but it’s not what most people are used to seeing in text styles.

Drawing with broad brushes, some Dutch speakers feel strongly that lange ij is a single letter with encoding issues and some other Dutch speakers feel more that it’s a letter combination with a special casing rule. Generalizing a bit, there is a Netherlands and Belgium divide on the issue.

MrBrezina commented 1 year ago

Thank you @moyogo, got it. Updated the design requirements one more time. Please, let me know if something does not sound right. I have read it too many times.

moyogo commented 1 year ago

It looks good to me. Thank you @MrBrezina.