n8willis / opentype-shaping-documents

Documentation of OpenType shaping behavior
170 stars 13 forks source link

[Thai/Lao] Mark reordering and tone markers #125

Open adrianwong opened 3 years ago

adrianwong commented 3 years ago

Our spec states the following (emphasis mine):

  • A "Nikhahit" or "Niggahita" mark that originated as part of an "Am" sign (which was decomposed in stage two, above) must be reordered so that it occurs before any tone markers in the sequence of marks.

According to our character tables, Thai has four TONE_MARKER characters U+0E48..U+0E4B, and Lao also has four tone marker characters U+0EC8..U+0ECB.

Some testing with Uniscribe, and some reading of HarfBuzz code has shown that this reordering is not just limited to tone markers, but rather, all abovebase marks.

(Note: Being unfamiliar with Thai/Lao, I am making the assumption that tone markers != abovebase marks.)

n8willis commented 3 years ago

Hmm; interesting. The MS docs do have a little classification list of their own for determining valid mark sequences as well, which is not quite the same as HarfBuzz's list.

You're definitely right about the "tone marker" != "above-base Mn" thing; there are dependent-vowel marks, killers, and some other marks I'm less sure of. I'll see what clarification I can find.

n8willis commented 3 years ago

So, the classes in the MS docs allegedly define a four-level ordering (upward from the base), with only one mark permitted from each level.

Unicode explicitly disagrees. From 16.1 (TUS 13, p 641):

For the purpose of rendering, the Thai combining marks above (U+0E31, U+0E34..U+0E37, U+0E47..U+0E4E) should be displayed outward from the base charac- ter they modify, in the order in which they appear in the text. In particular, a sequence con- taining <U+0E48 thai character mai ek, U+0E4D thai character nikhahit> should be displayed with the nikhahit above the mai ek, and a sequence containing <U+0E4D thai character nikhahit, U+0E48 thai character mai ek> should be displayed with the mai ek above the nikhahit.

On that specific example, the MS docs place U+0E4D in level 2 and U+0E48 in level 3, which would mean mai ek is always above nikhahit. For full dramatic irony, however, Unicode notes that "mai ek, nikhahit" is likely to be a typo in real-world text.

As per https://github.com/harfbuzz/harfbuzz/issues/1008 and a couple of other issues, HarfBuzz is taking the tactic that the docs do not seem to specify a full mark-reordering algorithm, so tracking compatibility with Uniscribe is the best available option.

For comparison, what HarfBuzz reorders vs the chartable vs the four MS levels is this:

code HB ct M1 M2 M3 M4 class
0080 y pad
0e31 y y TopDV
0e34 y y TopDV
0e35 y TopDV
0e36 y TopDV
0e37 y y TopDV
0e47 y y TopDV
0e48 y y tone
0e49 y y tone
0e4a y y tone
0e4b y y tone
0e4c y ConsK
0e4d y Bindu
0e4e y y PureK

HarfBuzz also treats the corresponding Lao codepoints in the same fashion, however, I did note that using the Thai->Lao offset on those codepoints leaves one out (U+0EBB, 'Sign Mai Kon') and ropes in one undefined (U+0EC7), although that last bit hardly matters.

It'd be easy to make a case for following HarfBuzz; might also be easy to make a case for mentioning the four-level model from MS in the same spot, but if that model is actually valid for the written language I'd want to open an issue to discuss it within HarfBuzz.

Perhaps @mhosken could weigh in on whether adding more reordering as MS alludes to is worth it? From what I can tell, the visual-order approach of Thai makes the overhead of having the shaper do reordering less important.