n8willis / opentype-shaping-documents

Documentation of OpenType shaping behavior
168 stars 13 forks source link

[Indic] Final Reph reordering #48

Open adrianwong opened 5 years ago

adrianwong commented 5 years ago

There are a few things that have come up from our implementation of the final Reph reordering that I thought I'd capture here:

b. If the reph repositioning class is not after post-base: target position is after the first explicit halant glyph between the first post-reph consonant and last main consonant. If ZWJ or ZWNJ are following this halant, position is moved after it. If such position is found, this is the target position. Otherwise, proceed to the next step. Note: in old-implementation fonts, where classifications were fixed in shaping engine, there was no case where reph position will be found on this step.

*** What we do is synonymous with OpenType's final Reph reordering step 4, but HarfBuzz and CoreText skip straight to step 5. This makes sense to me, as the second sentence in step 4:

If no consonant is found, the target position should be before the first matra, syllable modifier sign or vedic sign.

is redundant when followed by step 5 anyway.

n8willis commented 4 years ago

I've been re-examining this, and I cannot see a difference between REPH_POS_AFTER_SUBJOINED and REPH_POS_BEFORE_POST, according to the Microsoft docs, other than the fact that at step (e) you have a different value for "reordering class after the intended reph position".

Compare #84 in the Devanagari and Bengali documents, step 4.3.

HarfBuzz has these two Reph positions as distinct elements in the enum, but I don't see that they're treated differently.

adrianwong commented 4 years ago

I've been re-examining this, and I cannot see a difference between REPH_POS_AFTER_SUBJOINED and REPH_POS_BEFORE_POST, according to the Microsoft docs, other than the fact that at step (e) you have a different value for "reordering class after the intended reph position".

That seems to be the case for Bengali and Devanagari. However, Gujarati (which is REPH_POS_BEFORE_POST) appears to behave differently. Take the examples:

With REPH_POS_BEFORE_POST scripts, HarfBuzz first checks for explicit Halants then just moves the Reph to the end of the syllable. This explains the Gujarati behaviour observed above.

n8willis commented 4 years ago

I was going to expound on my take on the wording of MS step (b) ("If the reph repositioning class is not after post-base: target position is after the first explicit halant glyph between the first post-reph consonant and last main consonant."), but in lieu of doing that and getting confused, I decided to take a swipe at charting it:

Reph Position a b c d e mH
REPH_POS_BEFORE_POST X X X X
REPH_POS_AFTER_SUBJOINED X X
REPH_POS_BEFORE_SUBJOINED X X
REPH_POS_AFTER_POST X X
REPH_POS_AFTER_MAIN X X X X

Those are the steps (a being just a conditional, not an action, of course), with "mH" representing the catch-all "if you stop after a matra,Halant, reorder to the left of Halant", which is not in the MS sequence list.

Does that mesh with your interpretation?

n8willis commented 4 years ago

(note, though, that I'm not saying that the chart above reflects my earlier comment about _AFTER_SUBJOINED and _BEFORE_POST -- instead it's for taking a hopefully cleaner approach from scratch.)

n8willis commented 4 years ago

Here it is with all the scripts' names:

Script Reph Position a b c d e mH
Devanagari REPH_POS_BEFORE_POST X X X X
Bengali REPH_POS_AFTER_SUBJOINED X X
Gujarati REPH_POS_BEFORE_POST X X X X
Gurmukhi REPH_POS_BEFORE_SUBJOINED X X
Kannada REPH_POS_AFTER_POST X X
Malayalam REPH_POS_AFTER_MAIN X X X X
Oriya REPH_POS_AFTER_MAIN X X X X
Tamil REPH_POS_AFTER_POST X X
Telugu REPH_POS_AFTER_POST X X
Sinhala REPH_POS_AFTER_MAIN X X X X

...and sorted into the actual position order:

Script Reph Position a b c d e mH
Malayalam REPH_POS_AFTER_MAIN X X X X
Oriya REPH_POS_AFTER_MAIN X X X X
Sinhala REPH_POS_AFTER_MAIN X X X X
Gurmukhi REPH_POS_BEFORE_SUBJOINED X X
Bengali REPH_POS_AFTER_SUBJOINED X X
Devanagari REPH_POS_BEFORE_POST X X X X
Gujarati REPH_POS_BEFORE_POST X X X X
Kannada REPH_POS_AFTER_POST X X
Tamil REPH_POS_AFTER_POST X X
Telugu REPH_POS_AFTER_POST X X
n8willis commented 4 years ago

@adrianwong When you get a chance, perhaps you could take a look at these and note any inconsistencies?

adrianwong commented 4 years ago

Apologies for not responding sooner, @n8willis!

Reph Position a b c d e mH
REPH_POS_BEFORE_POST X X X X
REPH_POS_AFTER_SUBJOINED X X
REPH_POS_BEFORE_SUBJOINED X X
REPH_POS_AFTER_POST X X
REPH_POS_AFTER_MAIN X X X X

It looks like we've missed a column "f. Otherwise, reorder reph to the end of the syllable", which should apply to all positioning classes. That aside, the table meshes with my interpretation. So:

Reph Position a b c d e f mH
REPH_POS_BEFORE_POST X X X X X
REPH_POS_AFTER_SUBJOINED X X X
REPH_POS_BEFORE_SUBJOINED X X X
REPH_POS_AFTER_POST X X X
REPH_POS_AFTER_MAIN X X X X X

This is my attempt at tabling HarfBuzz's approach:

Reph Position a b c d e f mH
REPH_POS_BEFORE_POST X X2 X X
REPH_POS_AFTER_SUBJOINED X X1 X2 X X
REPH_POS_BEFORE_SUBJOINED X X2 X X
REPH_POS_AFTER_POST X2 X X
REPH_POS_AFTER_MAIN X X X2 X X

X1 : HarfBuzz applies step (d) to REPH_POS_AFTER_SUBJOINED instead. X2 : HarfBuzz's step (e) is a repetition of step (b).

n8willis commented 4 years ago

Well, my re-reading and re-dissecting these steps has me thinking that:

So, if you know that the script does not utilize ZWJ/ZWNJ (or shouldn't), then collapsing b & e into the same function would not mess anything up.

Similarly, if you know that the script doesn't incorporate pre-base-reordering Ra, then collapsing c into d would be a non issue as well. But that doesn't explain what HarfBuzz does.

However, there's no gap between POS_AFTER_SUBJOINED and POS_BEFORE_POST ... I'm not convinced you could end up with a different Reph position using one or the other.

On X1, it looks to me like HarfBuzz does not follow the (d) with (e).

adrianwong commented 4 years ago

the difference between (b) and (e) is that (b) checks for ZWJ/ZWNJ, which (e) does not do

Could you expound on this? I'm struggling to see how (b) and (e) are alike (sans joiner checks).

On X1, it looks to me like HarfBuzz does not follow the (d) with (e).

Doesn't it? (d) falls through to (e) if the check in line 1302 fails.

n8willis commented 4 years ago

the difference between (b) and (e) is that (b) checks for ZWJ/ZWNJ, which (e) does not do

Could you expound on this? I'm struggling to see how (b) and (e) are alike (sans joiner checks).

Oh. Ahh; I think I was looking at (d) instead of at (e).... Too much flipping back-and-forth between numbered lists and lettered lists.... One of those days.

On X1, it looks to me like HarfBuzz does not follow the (d) with (e).

Doesn't it? (d) falls through to (e) if the check in line 1302 fails.

I'm certainly easily confused, so I defer to your tracing through here. But it looked to me like new_reph_pos would always be less than end.

That said, I still don't grasp why HB repeats step 2 as its step 5. It happened in a commit indicating it was fixing Kannada failures; if (5) wasn't done, then Kannada would jump right to the "move to end" option,

n8willis commented 4 years ago

I'm inclined to close this after merging #84, since it deals with the initial scope of the issue. We do still have the problem of HarfBuzz's apparent divergence from the MS docs, however.

I'd prefer to raise that question within the HarfBuzz org, see what happens, and then either (a) open a new issue to bring these docs inline with the clarified situation from HB or (b) pause further changes to 4.3 while HarfBuzz re-examines that bit of code ... whichever of those outcomes happens.

But, @adrianwong if you don't feel like the issue is resolved yet, by all means feel free to keep it open.

adrianwong commented 4 years ago

I'm certainly easily confused, so I defer to your tracing through here. But it looked to me like new_reph_pos would always be less than end.

Ah! You may very well be right - I only based my analysis off the existence of the if, which implies the possibility of new_reph_pos >= end. This stuff is tricky, to say the least.

That said, I still don't grasp why HB repeats step 2 as its step 5. It happened in a commit indicating it was fixing Kannada failures; if (5) wasn't done, then Kannada would jump right to the "move to end" option.

I've come across that commit too. I don't grasp why it had to be done either. Perhaps I should spend some time re-analysing this algorithm within the context of Kannada shaping.

I'd prefer to raise that question within the HarfBuzz org, see what happens, and then either (a) open a new issue to bring these docs inline with the clarified situation from HB or (b) pause further changes to 4.3 while HarfBuzz re-examines that bit of code ... whichever of those outcomes happens.

Would it be okay with you if we kept this open for now, raise the question with HarfBuzz, open a new issue here, then close this one? (Out of sight, out of mind and all that.)

n8willis commented 4 years ago

Would it be okay with you if we kept this open for now, raise the question with HarfBuzz, open a new issue here, then close this one? (Out of sight, out of mind and all that.)

Yeah; sounds good.

n8willis commented 4 years ago

So the word from HarfBuzz is two-part:

1) The HarfBuzz code actually does do the same thing as step e and step f, it's just divided up slightly differently between the two.

2) The inline comments in HarfBuzz are confusing on that point, largely because they quote the steps from the steps in the MS docs without attempting to split up those quotes to exactly match the HarfBuzz division (an understandable choice, since dicing up the quotes might also cause its own brand of confusion), but also somewhat because comments like the "copied from b" part seem to apply to a larger block of code than they actually do.

Since I do some HarfBuzz doc wrangling, I might could clean up the comment stuff for future reference.

n8willis commented 4 years ago

@adrianwong Any more to be done on this issue, do you think?

adrianwong commented 4 years ago

Apologies for the delay here. Give me a week or two to re-familiarise myself with this; it's clear from Behdad's response on the other thread that there are some things that I've missed.

n8willis commented 2 years ago

Super-friendly and non-threatening ping!

adrianwong commented 2 years ago

non-threatening ping

If someone says that they'll get back to you in "a week or two" but hasn't responded in 1.5 years, I think it's well within your right to make this ping a threatening one.

Re-reading this and the associated HarfBuzz thread, I think we were trying to address three issues:

  1. Applying step (d) to REPH_POS_AFTER_SUBJOINED instead of REPH_POS_BEFORE_POST (see this old post). Judging by the comment on this line, my guess is that Uniscribe may not be doing what the spec says it should do either. I'm not entirely sure how we should resolve this.
  2. Repeating step (b) at step (e). From what I gather, we're still not sure why this was done beyond it fixing some Kannada failures.
  3. What's happened to step (e) then? As you and Behdad have discussed, this step was just pushed into step (f).
n8willis commented 2 years ago

non-threatening ping

If someone says that they'll get back to you in "a week or two" but hasn't responded in 1.5 years, I think it's well within your right to make this ping a threatening one.

Couldn't risk it, karmically. I'm farther behind than that on waaaay too many things....

1. **Applying step (d) to `REPH_POS_AFTER_SUBJOINED` instead of `REPH_POS_BEFORE_POST` (see [this](https://github.com/n8willis/opentype-shaping-documents/issues/48#issuecomment-558481098) old post).** Judging by the comment on [this line](https://github.com/harfbuzz/harfbuzz/blob/78f3d7f0a8dc399415dbd6983212997fdf9831b1/src/hb-ot-shape-complex-indic.cc#L1257), my guess is that Uniscribe may not be doing what the spec says it should do either. I'm not entirely sure how we should resolve this.

I'll take a quick look at this either tonight or tomorrow, schedule permitting. There's always the possibility of filing an issue about it on Microsoft's newfangled GitHub repo, which didn't used to be an option, and putting it into the Uniscribe-compatibilty category.

2. **Repeating step (b) at step (e).** From what I gather, we're still not sure why this was done beyond it [fixing some Kannada failures](https://github.com/harfbuzz/harfbuzz/issues/2298#issuecomment-614814994).

Yeah, and fortunately Behdad suggested in that thread that he was definitely open to improving it. If there's a specific test case or two that would probably be ideal; it appears from that thread that the issue just caught him during a week when he didn't have all the test apparatus set up.

Was the example in comment 3 where it actually came up?

adrianwong commented 2 years ago

Was the example in https://github.com/n8willis/opentype-shaping-documents/issues/48#issuecomment-536838375 where it actually came up?

Gosh. My memory is very hazy at this point, but it appears to be somewhat related to both issues (1) and (2).

I think what I was trying to illustrate with the Gujarati example (which has REPH_POS_BEFORE_POST behaviour), was that its behaviour in Uniscribe (and HarfBuzz) was inconsistent with steps (d) and (e) of the MS spec.