Open adrianwong opened 3 years ago
Hmm; interesting. The MS docs do have a little classification list of their own for determining valid mark sequences as well, which is not quite the same as HarfBuzz's list.
You're definitely right about the "tone marker" != "above-base Mn" thing; there are dependent-vowel marks, killers, and some other marks I'm less sure of. I'll see what clarification I can find.
So, the classes in the MS docs allegedly define a four-level ordering (upward from the base), with only one mark permitted from each level.
Unicode explicitly disagrees. From 16.1 (TUS 13, p 641):
For the purpose of rendering, the Thai combining marks above (U+0E31, U+0E34..U+0E37, U+0E47..U+0E4E) should be displayed outward from the base charac- ter they modify, in the order in which they appear in the text. In particular, a sequence con- taining <U+0E48 thai character mai ek, U+0E4D thai character nikhahit> should be displayed with the nikhahit above the mai ek, and a sequence containing <U+0E4D thai character nikhahit, U+0E48 thai character mai ek> should be displayed with the mai ek above the nikhahit.
On that specific example, the MS docs place U+0E4D in level 2 and U+0E48 in level 3, which would mean mai ek is always above nikhahit. For full dramatic irony, however, Unicode notes that "mai ek, nikhahit" is likely to be a typo in real-world text.
As per https://github.com/harfbuzz/harfbuzz/issues/1008 and a couple of other issues, HarfBuzz is taking the tactic that the docs do not seem to specify a full mark-reordering algorithm, so tracking compatibility with Uniscribe is the best available option.
For comparison, what HarfBuzz reorders vs the chartable vs the four MS levels is this:
code | HB | ct | M1 | M2 | M3 | M4 | class |
---|---|---|---|---|---|---|---|
0080 | y | pad | |||||
0e31 | y | y | TopDV | ||||
0e34 | y | y | TopDV | ||||
0e35 | y | TopDV | |||||
0e36 | y | TopDV | |||||
0e37 | y | y | TopDV | ||||
0e47 | y | y | TopDV | ||||
0e48 | y | y | tone | ||||
0e49 | y | y | tone | ||||
0e4a | y | y | tone | ||||
0e4b | y | y | tone | ||||
0e4c | y | ConsK | |||||
0e4d | y | Bindu | |||||
0e4e | y | y | PureK |
HarfBuzz also treats the corresponding Lao codepoints in the same fashion, however, I did note that using the Thai->Lao offset on those codepoints leaves one out (U+0EBB, 'Sign Mai Kon') and ropes in one undefined (U+0EC7), although that last bit hardly matters.
It'd be easy to make a case for following HarfBuzz; might also be easy to make a case for mentioning the four-level model from MS in the same spot, but if that model is actually valid for the written language I'd want to open an issue to discuss it within HarfBuzz.
Perhaps @mhosken could weigh in on whether adding more reordering as MS alludes to is worth it? From what I can tell, the visual-order approach of Thai makes the overhead of having the shaper do reordering less important.
Our spec states the following (emphasis mine):
According to our character tables, Thai has four
TONE_MARKER
charactersU+0E48..U+0E4B
, and Lao also has four tone marker charactersU+0EC8..U+0ECB
.Some testing with Uniscribe, and some reading of HarfBuzz code has shown that this reordering is not just limited to tone markers, but rather, all abovebase marks.
(Note: Being unfamiliar with Thai/Lao, I am making the assumption that tone markers != abovebase marks.)