rosettatype / hyperglot

Hyperglot: a database and tools for detecting language support in fonts
http://hyperglot.rosettatype.com
GNU General Public License v3.0
169 stars 23 forks source link

Shaping checks for Brahmi-derived scripts #176

Open MrBrezina opened 2 months ago

MrBrezina commented 2 months ago

We welcome comments on the following proposal. Regarding implementation, please use #175 .

Objective

We want to test fonts ability to correctly render texts in corresponding Brahmi-derived scripts. This group includes (i.e. most Indian/Indic scripts:

and some of the SEA scripts, for example:

We focus on Unicode Standard and OpenType technologies and recommended practices.

For convenience US refers to the Unicode Standard in version 15.0 as available here: https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf

Context

See US chapter on Devanagari on full introduction to the shaping of

Brahmi-derived scripts form more complex syllables (conjuncts) from sequences of two or more atomic consonantal syllables (also called consonants for short). Each consonant contains an inherent vowel “a” which can be killed using a symbol called virama/halant, thus producing a dead consonant. This typically results in chaining of consonants to form conjuncts:

the sequence of the consonant Ka, virama, and consonant Pa will result in a conjunct KPa Ka + virama + Pa → KPa

Note, before → is a sequence of code points, after → is the visual representation, i.e. rendering.

“Normally a virama character serves to create dead consonants that are, in turn, combined with subsequent consonants to form conjuncts. This behavior usually results in a virama sign not being depicted visually. Occasionally, this default behavior is not desired when a dead consonant should be excluded from conjunct formation, in which case the virama sign is visibly rendered. To accomplish this goal, the Unicode Standard adopts the convention of placing the character U+200C zero width non-joiner immediately after the encoded dead consonant that is to be excluded from conjunct formation. In this case, the virama sign is always depicted as appropriate for the consonant to which it is attached.” (p. 470 of the US)

Without ZWNJ, a conjunct should be rendered (if available):

Ka + virama + Ka → KKa

Zero-width non-joiner (ZWNJ, U+200C) ensures virama in the dead consonant is rendered visually:

Ka + virama + ZWNJ + Ka → Ka + virama + Ka

“When a dead consonant participates in forming a conjunct, the dead consonant form is often absorbed into the conjunct form, such that it is no longer distinctly visible. In other contexts, the dead consonant may remain visible as a half-consonant form. In general, a half-consonant form is distinguished from the nominal consonant form by the loss of its inherent vowel stem, a vertical stem appearing to the right side of the con- sonant form. In other cases, the vertical stem remains but some part of its right-side geometry is missing. In certain cases, it is desirable to prevent a dead consonant from assuming full conjunct formation yet still not appear with an explicit virama. In these cases, the half-form of the consonant is used. To explicitly encode a half-consonant form, the Unicode Standard adopts the convention of placing the character U+200D zero width joiner immediately after the encoded dead consonant.”(p. 470–1 of the US)

Without ZWJ, a conjunct should be rendered (if available):

Ka + virama + Ka → KKa

ZWJ ensures a half form is rendered instead:

Ka + virama + ZWJ + Ka → K- + Ka

Note, that half forms may also exist for conjuncts. This would depend on the organisation and design of the font.

Possible checks

There are multiple shaping test that could be conducted. They could be derived from the “R notes” (effectively rules of the script grammar(s)) in the US in chapters for Indic scripts. Some of these rules are handled by a shaping engine (e.g. order of characters in the text). Others are handled by a font. Specifically, the following three (exemplified on Devanagari):

1. Conjunct formation

Syllables that include virama should ideally produce a conjunct represented either as a single precomposed glyph or sequence of partial glyphs (e.g. half forms). For most frequent conjucts the virama should not be rendered visually in the output. However, there might be rare syllabic sequences that do not have visually distinct conjunct established, in that case they should use either half forms, or explicit virama. Also, sequences with ZWNJ before virama should produce virama.

Input: syllabic sequence, e.g. consonant + virama + consonant Effect: no virama glyph in the output

Note on collecting syllables and conjuncts: A set of plausible syllables can be collected from online sources, post-processed, and filtered to obtain only well-formed syllables in order to avoid frequent typos or non-US way of encoding string. Syllables that include virama should ideally produce a conjunct. Frequency should be noted for each syllable. An arbitrary threshold (e.g. 0.5–1%) can be used to approve/reject conjunct support in a font. We prefer method as it is scalable and independent of particular opinions. However, it can be extended manually in case the online corpora are insufficient.

2. Half-form formation

A set of simple half forms can be defined for each script and their shaping tested this way:

Input: consonant + virama + ZWJ Effect: no virama glyph in the output

Note on collecting half form strings for testing: examples of half forms made of atomic consonants are mentioned in the US. Not all atomic consonants produce half forms. Thus, these need to be defined manually.

3. Ligature formation

Note, that ligatures refer to predefined visual representations of combinations of consonants with vowels. A set of basic ligatures is defined in the US and can be tested.

Input: sequence of code points Effect: GSUB substitution happened

kontur commented 2 months ago

First implementations of these in this branch. Note that there is only tentative data for hin.yaml so far. The conjunct shaping check is the first check to support a threshold, so e.g. by default 95% (or some such) of listed conjuncts need to form without fault.

kontur commented 1 month ago

Getting there. The implementation now has hin.yaml with combinations that each have a normalized frequency. The CLI -t (threshold) argument can set up to what frequency conjuncts need to be supported by a font to pass.

The halfforms inside the combinations needs some reviewing, because some non-shaping halfforms are parsed from the language corpus data that are not actually consuming virama. (Now commented out.)

Furthermore, I've added a check to confirm marks in combinations get attached. This proved more tricky than the mark attachment check for e.g. Latin, because many marks are designed without anchors and align to the preceding stem simply by negative left sidebearing, so it is impossible to conclusively test these attach in a reasonable way. I've included this check, but it's non-failing and simply outputs a warning like:

Mark positioning for cluster 'वृ' failed. Unpositioned marks: u0943
Mark positioning for cluster 'पृ' failed. Unpositioned marks: u0943
Mark positioning for cluster 'मृ' failed. Unpositioned marks: u0943
Mark positioning for cluster 'तृ' failed. Unpositioned marks: u0943
Mark positioning for cluster 'उं' failed. Unpositioned marks: u0902
Mark positioning for cluster 'गृ' failed. Unpositioned marks: u0943
Mark positioning for cluster 'श्रृं' failed. Unpositioned marks: u0943
Mark positioning for cluster 'स्तृ' failed. Unpositioned marks: u0943
Mark positioning for cluster 'सृ' failed. Unpositioned marks: u0943
Mark positioning for cluster 'बृ' failed. Unpositioned marks: u0943
Mark positioning for cluster 'नृ' failed. Unpositioned marks: u0943
...

I haven't a clue on how to tackle this, since this seems pretty common practice for Devanagari fonts. We could check the negative LSB is at least the width of the mark itself on any such non-attached mark, but really that guarantees nothing.