n8willis / opentype-shaping-documents

Documentation of OpenType shaping behavior
161 stars 15 forks source link

[Khmer] Establish a syllable regex strategy #129

Open n8willis opened 3 years ago

n8willis commented 3 years ago

This issue asks the question "how should these docs define syllables for Khmer", which is a non-trivial question because there are several non-identical definitions from the upstream sources that are common reference points for some of the other scripts.

Basically, there is what's written in the Unicode Standard, there's Microsoft's Script-shaping docs, there's the Open Forum of Cambodia, and there are also the practical examples of SIL's Mondulkiri font family and HarfBuzz. There may also be others that I don't have (or haven't found) access to (e.g., Apple, various Adobe implementations, type foundries, etc).

When the original version of the Khmer doc in this repo was added, it just took the "follow what HarfBuzz does" approach. Although it only made a snapshot of that, since the HarfBuzz developers have continued to improve it.

It seems to be the consensus view that the existing/prior sources' regular expressions don't contain a single, perfect model that the others should all adopt. The W3C font-and-text Community Group has been having regular working calls to tackle issues for Khmer (of which syllable formulation is just one).

That group discussion seems like a good place to pin down as "what we want this repo to document", although it is not in a completed state, so anybody who might be tackling Khmer implementation today wouldn't have an easy time of it.

In the meantime, the question remains whether or not these docs ought to stick to the "follow HarfBuzz" approach or something else. Considering that Allsorts is using these docs as a reference, it would seem pretty valuable to be able to say that the various FOSS shaping engines are in step, but this is an open forum, so if anyone has a better birds-eye-view approach to suggest, please do so.

n8willis commented 3 years ago

Here is what I've found as the existing in-the-wild Khmer syllable regular-expressions....

Unicode (as of v13)

B {R | C} {S {R}}* {{Z} V} {O} {S}

where:
B = base character (consonant character, independent vowel character, and so on)
R = robat
C = consonant shifter
S = subscript consonant or independent vowel sign
V = dependent vowel sign
Z = zero width non-joiner or a zero width joiner
O = any other sign

Microsoft

Cons + {COENG + (Cons | IndV)} + [PreV | BlwV] + [RegShift] + [AbvV] + {AbvS} + [PstV] + [PstS]

where:
Cons =  Consonant
IndV = Independent Vowel
COENG = "Sign Coeng" code
PreV = pre-base dependent vowel, with "prebase+postbase" dependent vowels classified as PstV
BlwV = below-base dependent vowel
RegShift = register shifter (Triisap or Muusikatoan)
AbvV = above-base dependent vowel, with "prebase+abovebase" dependent vowels classified as AbvV
AbvS = above-base sign or mark
Robat = Robat glyph
PstV = post-base dependent vowel
PstS = post-base sign

Open Forum of Cambodia

Consonant + Robat {+ Vowel} {+ Sign}
OR
Consonant + Coeng_consonant(s) + Consonant_shifter + Vowel + Above_signs + After_signs

where:
Consonant = [U+1780..U+17A2] or [U+17A5..U+17B3]
Coeng consonant = [U+17D2] + {[U+1780..U+17A2] or [U+17A5..U+17B3]}
Vowel = [U+17B6..U+17C5]
Above_sign = [U+17C6, U+17CB, U+17CD..U+17D1, U+17D3, U+17DD]
After_sign = [U+17C7, U+17C8]
Sign = Above_sign OR After_sign
Consonant_shifter = [U+17C9, U+17CA]
Robat = [U+17CC]

SIL Mondulkiri

B {{{Z1} S} or R} C {{Z1} S} C {{Z1} S} {{Z2} V} NS C SS

where:
B  = consonant | independent_vowel
Z1 = ZWJ | ZWNJ
S  = register_shifter
R  = Robat
C  = _sign_coeng_ B
Z2 = ZWJ | ZWNJ
V  = dependent_vowel
NS = non-spacing symbol
SS = spacing symbol
n8willis commented 3 years ago

Quick attempt to rewrite the above expressions with the same symbol set / syntax:

B = base = _consonant_ | _independent_vowel_
R = "Robat"
S = _register_shifter_
K = "Sign Coeng"
M = _dependent_vowel_
Z = "ZWJ" | "ZWNJ"
T = nonspacing_mark/symbol
Y = spacing_mark / syllable_modifier_symbol

Unicode

B (R | S)? ((K B) R?)* (Z? M)? (T|Y)? (K B)?

Microsoft

B (K B){0,2} M? S? M? (T){0,2} M? Y?

Open Forum

B R? M? (T|Y)?
OR
B (K B)* S? M? T? Y? 

SIL Mondulkiri

B ((Z? S) | R)? (K B)? (Z? S)? (K B)? (Z? S)? (Z? M)? T? (K B)? Y?

NorbertLindenberg commented 3 years ago

See also Issues in Khmer syllable validation

n8willis commented 3 years ago

At the risk of overcomplicating matters, I'm taking a stab at updating the syllable-IDing expressions in the docs to correctly match what HarfBuzz does. That branch is available here: https://github.com/n8willis/opentype-shaping-documents/blob/khmer-syllables-2/opentype-shaping-khmer.md#1-identifying-syllables-and-other-sequences

Note that HarfBuzz's current Ragel machine uses general terms ("Xgroup" and "Ygroup") for the marks/diacritics classes that require separate treatment. The two groups are almost, but not exactly, the same as the W3C-cg's "non-spacing diacritics" and "right side, spacing diacritics" classes, so I elected to name them that way in the WIP branch.

The main differences are (1) that HarfBuzz puts U+17DD in with the "right side, spacing diacritics" instead of the "non-spacing diacritics" and (2) that HarfBuzz puts U+17D3 into the "right side, spacing diacritics" class whereas the W3C-cg documents don't include it anywhere.

In any case, those are still just classes, not the expressions to match syllables & subsyllables. The actual syllable-ID expressions under development by the W3C-cg folks take an entirely different approach than HarfBuzz's expressions, but I thought that at least aligning the classes a bit closer might reduce confusion. We shall see....

n8willis commented 2 years ago

Informational update only:

It looks as though the ad-hoc interest group has settled on a set of regular expressions that works well for contemporary Khmer, and has at least developed an extension that also handles Middle Khmer — however, the Middle Khmer exception-set is rather large ... but some shuffling and additional work could simplify it.

Eventually, the effort is likely to move to a UTC issue, as well as to several Khmer-language ministries and institutions. Although the concerns of some of those groups are essentially "input" (e.g., how to design software keyboards to let people type things in the correct order and avoid incorrect orders that used to result in virtually-identical renderings) and "data-cleaning" (e.g., how to identify confusable strings in old documents and fix the incorrect ones to their correct orderings).