n8willis / opentype-shaping-documents

Documentation of OpenType shaping behavior
170 stars 13 forks source link

Adjacent marks #34

Closed mikeday closed 5 years ago

mikeday commented 6 years ago

https://github.com/n8willis/opentype-shaping-documents/blob/master/opentype-shaping-indic-general.md#24-adjacent-marks

"Fourth, any subsequences of adjacent marks ("Halant"s, "Nukta"s, syllable modifiers, and Vedic signs) must be reordered so that they appear in canonical order."

that is to say, Unicode canonical order?

n8willis commented 6 years ago

I believe so; I think it's the Unicode 'canonical combining class' sorting. It's just that in the Indic scripts, there aren't many marks that need examining.

I looked through the UCD data, and Halants are class "9", Nuktas are "7". Essentially nothing else is in a reorderable class, except for Vedic signs, which would be in the tail, not mixed in with Nuktas and Halants.

If HarfBuzz, the issue seems to be just ensuring that Halants don't get placed next to the wrong base. It also notes, however, that the Nuktas (in HarfBuzz's code path that is) are handled by the unicode-normalization step, so nothing needs to be done here.

I need to take a second read through the source there; the wording in the inline comments is a little odd.

n8willis commented 6 years ago

So it turns out that this comes directly from the Microsoft script documents (see https://docs.microsoft.com/en-us/typography/script-development/devanagari ). The wording there is, I think, somewhat ambiguous; it says

"Adjacent nukta and halant or nukta and vedic sign are always repositioned if necessary, so that the nukta is first."

Strict interpretation of that would mean that it looks for "Halant,Nukta" and "vedicsign,Nukta" sequences, nothing else.

There are some marks in the Vedic Extensions block that are lower in CCC than "Nukta", such as U+1CE2 .... Those are ccc=1, so would be sorted before the Nukta. But they are all also "overstrike" marks and thus wouldn't visually look different.

So it's possible that HarfBuzz just doesn't mess with the overstrikes for simplicity. But I'm not sure whether just testing for the two Nukta-sequences literally mentioned by Microsoft is fine, or if the whole sequence of adjacent marks technically ought to get sorted -- if the shaping engine is going to do that here.

@behdad Do you have any advice on whether this means just "Halant,Nukta" and "vedicsign,Nukta" should get looked at, or any/all sequences of marks gets reordered?

n8willis commented 5 years ago

Hopefully clarified in b3d96ac.