Open adrianwong opened 5 years ago
I've been re-examining this, and I cannot see a difference between REPH_POS_AFTER_SUBJOINED and REPH_POS_BEFORE_POST, according to the Microsoft docs, other than the fact that at step (e) you have a different value for "reordering class after the intended reph position".
Compare #84 in the Devanagari and Bengali documents, step 4.3.
HarfBuzz has these two Reph positions as distinct elements in the enum, but I don't see that they're treated differently.
I've been re-examining this, and I cannot see a difference between REPH_POS_AFTER_SUBJOINED and REPH_POS_BEFORE_POST, according to the Microsoft docs, other than the fact that at step (e) you have a different value for "reordering class after the intended reph position".
That seems to be the case for Bengali and Devanagari. However, Gujarati (which is REPH_POS_BEFORE_POST
) appears to behave differently. Take the examples:
POS_AFTER_SUBJOINED
position tag. This is consistent with your findings.POS_AFTER_POST
position tag. HarfBuzz and Uniscribe both do this.With REPH_POS_BEFORE_POST
scripts, HarfBuzz first checks for explicit Halants then just moves the Reph to the end of the syllable. This explains the Gujarati behaviour observed above.
I was going to expound on my take on the wording of MS step (b) ("If the reph repositioning class is not after post-base: target position is after the first explicit halant glyph between the first post-reph consonant and last main consonant."), but in lieu of doing that and getting confused, I decided to take a swipe at charting it:
Reph Position | a | b | c | d | e | mH |
---|---|---|---|---|---|---|
REPH_POS_BEFORE_POST |
X | X | X | X | ||
REPH_POS_AFTER_SUBJOINED |
X | X | ||||
REPH_POS_BEFORE_SUBJOINED |
X | X | ||||
REPH_POS_AFTER_POST |
X | X | ||||
REPH_POS_AFTER_MAIN |
X | X | X | X |
Those are the steps (a being just a conditional, not an action, of course), with "mH" representing the catch-all "if you stop after a matra,Halant, reorder to the left of Halant", which is not in the MS sequence list.
Does that mesh with your interpretation?
(note, though, that I'm not saying that the chart above reflects my earlier comment about _AFTER_SUBJOINED and _BEFORE_POST -- instead it's for taking a hopefully cleaner approach from scratch.)
Here it is with all the scripts' names:
Script | Reph Position | a | b | c | d | e | mH |
---|---|---|---|---|---|---|---|
Devanagari | REPH_POS_BEFORE_POST |
X | X | X | X | ||
Bengali | REPH_POS_AFTER_SUBJOINED |
X | X | ||||
Gujarati | REPH_POS_BEFORE_POST |
X | X | X | X | ||
Gurmukhi | REPH_POS_BEFORE_SUBJOINED |
X | X | ||||
Kannada | REPH_POS_AFTER_POST |
X | X | ||||
Malayalam | REPH_POS_AFTER_MAIN |
X | X | X | X | ||
Oriya | REPH_POS_AFTER_MAIN |
X | X | X | X | ||
Tamil | REPH_POS_AFTER_POST |
X | X | ||||
Telugu | REPH_POS_AFTER_POST |
X | X | ||||
Sinhala | REPH_POS_AFTER_MAIN |
X | X | X | X |
...and sorted into the actual position order:
Script | Reph Position | a | b | c | d | e | mH |
---|---|---|---|---|---|---|---|
Malayalam | REPH_POS_AFTER_MAIN |
X | X | X | X | ||
Oriya | REPH_POS_AFTER_MAIN |
X | X | X | X | ||
Sinhala | REPH_POS_AFTER_MAIN |
X | X | X | X | ||
Gurmukhi | REPH_POS_BEFORE_SUBJOINED |
X | X | ||||
Bengali | REPH_POS_AFTER_SUBJOINED |
X | X | ||||
Devanagari | REPH_POS_BEFORE_POST |
X | X | X | X | ||
Gujarati | REPH_POS_BEFORE_POST |
X | X | X | X | ||
Kannada | REPH_POS_AFTER_POST |
X | X | ||||
Tamil | REPH_POS_AFTER_POST |
X | X | ||||
Telugu | REPH_POS_AFTER_POST |
X | X |
@adrianwong When you get a chance, perhaps you could take a look at these and note any inconsistencies?
Apologies for not responding sooner, @n8willis!
Reph Position a b c d e mH REPH_POS_BEFORE_POST
X X X X REPH_POS_AFTER_SUBJOINED
X X REPH_POS_BEFORE_SUBJOINED
X X REPH_POS_AFTER_POST
X X REPH_POS_AFTER_MAIN
X X X X
It looks like we've missed a column "f. Otherwise, reorder reph to the end of the syllable", which should apply to all positioning classes. That aside, the table meshes with my interpretation. So:
Reph Position | a | b | c | d | e | f | mH |
---|---|---|---|---|---|---|---|
REPH_POS_BEFORE_POST |
X | X | X | X | X | ||
REPH_POS_AFTER_SUBJOINED |
X | X | X | ||||
REPH_POS_BEFORE_SUBJOINED |
X | X | X | ||||
REPH_POS_AFTER_POST |
X | X | X | ||||
REPH_POS_AFTER_MAIN |
X | X | X | X | X |
This is my attempt at tabling HarfBuzz's approach:
Reph Position | a | b | c | d | e | f | mH |
---|---|---|---|---|---|---|---|
REPH_POS_BEFORE_POST |
X | X2 | X | X | |||
REPH_POS_AFTER_SUBJOINED |
X | X1 | X2 | X | X | ||
REPH_POS_BEFORE_SUBJOINED |
X | X2 | X | X | |||
REPH_POS_AFTER_POST |
X2 | X | X | ||||
REPH_POS_AFTER_MAIN |
X | X | X2 | X | X |
X1 : HarfBuzz applies step (d) to REPH_POS_AFTER_SUBJOINED
instead.
X2 : HarfBuzz's step (e) is a repetition of step (b).
Well, my re-reading and re-dissecting these steps has me thinking that:
So, if you know that the script does not utilize ZWJ/ZWNJ (or shouldn't), then collapsing b & e into the same function would not mess anything up.
Similarly, if you know that the script doesn't incorporate pre-base-reordering Ra, then collapsing c into d would be a non issue as well. But that doesn't explain what HarfBuzz does.
However, there's no gap between POS_AFTER_SUBJOINED and POS_BEFORE_POST ... I'm not convinced you could end up with a different Reph position using one or the other.
On X1, it looks to me like HarfBuzz does not follow the (d) with (e).
the difference between (b) and (e) is that (b) checks for ZWJ/ZWNJ, which (e) does not do
Could you expound on this? I'm struggling to see how (b) and (e) are alike (sans joiner checks).
On X1, it looks to me like HarfBuzz does not follow the (d) with (e).
Doesn't it? (d) falls through to (e) if the check in line 1302 fails.
the difference between (b) and (e) is that (b) checks for ZWJ/ZWNJ, which (e) does not do
Could you expound on this? I'm struggling to see how (b) and (e) are alike (sans joiner checks).
Oh. Ahh; I think I was looking at (d) instead of at (e).... Too much flipping back-and-forth between numbered lists and lettered lists.... One of those days.
On X1, it looks to me like HarfBuzz does not follow the (d) with (e).
Doesn't it? (d) falls through to (e) if the check in line 1302 fails.
I'm certainly easily confused, so I defer to your tracing through here. But it looked to me like new_reph_pos
would always be less than end
.
That said, I still don't grasp why HB repeats step 2 as its step 5. It happened in a commit indicating it was fixing Kannada failures; if (5) wasn't done, then Kannada would jump right to the "move to end" option,
I'm inclined to close this after merging #84, since it deals with the initial scope of the issue. We do still have the problem of HarfBuzz's apparent divergence from the MS docs, however.
I'd prefer to raise that question within the HarfBuzz org, see what happens, and then either (a) open a new issue to bring these docs inline with the clarified situation from HB or (b) pause further changes to 4.3 while HarfBuzz re-examines that bit of code ... whichever of those outcomes happens.
But, @adrianwong if you don't feel like the issue is resolved yet, by all means feel free to keep it open.
I'm certainly easily confused, so I defer to your tracing through here. But it looked to me like new_reph_pos would always be less than end.
Ah! You may very well be right - I only based my analysis off the existence of the if
, which implies the possibility of new_reph_pos >= end
. This stuff is tricky, to say the least.
That said, I still don't grasp why HB repeats step 2 as its step 5. It happened in a commit indicating it was fixing Kannada failures; if (5) wasn't done, then Kannada would jump right to the "move to end" option.
I've come across that commit too. I don't grasp why it had to be done either. Perhaps I should spend some time re-analysing this algorithm within the context of Kannada shaping.
I'd prefer to raise that question within the HarfBuzz org, see what happens, and then either (a) open a new issue to bring these docs inline with the clarified situation from HB or (b) pause further changes to 4.3 while HarfBuzz re-examines that bit of code ... whichever of those outcomes happens.
Would it be okay with you if we kept this open for now, raise the question with HarfBuzz, open a new issue here, then close this one? (Out of sight, out of mind and all that.)
Would it be okay with you if we kept this open for now, raise the question with HarfBuzz, open a new issue here, then close this one? (Out of sight, out of mind and all that.)
Yeah; sounds good.
So the word from HarfBuzz is two-part:
1) The HarfBuzz code actually does do the same thing as step e and step f, it's just divided up slightly differently between the two.
2) The inline comments in HarfBuzz are confusing on that point, largely because they quote the steps from the steps in the MS docs without attempting to split up those quotes to exactly match the HarfBuzz division (an understandable choice, since dicing up the quotes might also cause its own brand of confusion), but also somewhat because comments like the "copied from b" part seem to apply to a larger block of code than they actually do.
Since I do some HarfBuzz doc wrangling, I might could clean up the comment stuff for future reference.
@adrianwong Any more to be done on this issue, do you think?
Apologies for the delay here. Give me a week or two to re-familiarise myself with this; it's clear from Behdad's response on the other thread that there are some things that I've missed.
Super-friendly and non-threatening ping!
non-threatening ping
If someone says that they'll get back to you in "a week or two" but hasn't responded in 1.5 years, I think it's well within your right to make this ping a threatening one.
Re-reading this and the associated HarfBuzz thread, I think we were trying to address three issues:
REPH_POS_AFTER_SUBJOINED
instead of REPH_POS_BEFORE_POST
(see this old post). Judging by the comment on this line, my guess is that Uniscribe may not be doing what the spec says it should do either. I'm not entirely sure how we should resolve this.non-threatening ping
If someone says that they'll get back to you in "a week or two" but hasn't responded in 1.5 years, I think it's well within your right to make this ping a threatening one.
Couldn't risk it, karmically. I'm farther behind than that on waaaay too many things....
1. **Applying step (d) to `REPH_POS_AFTER_SUBJOINED` instead of `REPH_POS_BEFORE_POST` (see [this](https://github.com/n8willis/opentype-shaping-documents/issues/48#issuecomment-558481098) old post).** Judging by the comment on [this line](https://github.com/harfbuzz/harfbuzz/blob/78f3d7f0a8dc399415dbd6983212997fdf9831b1/src/hb-ot-shape-complex-indic.cc#L1257), my guess is that Uniscribe may not be doing what the spec says it should do either. I'm not entirely sure how we should resolve this.
I'll take a quick look at this either tonight or tomorrow, schedule permitting. There's always the possibility of filing an issue about it on Microsoft's newfangled GitHub repo, which didn't used to be an option, and putting it into the Uniscribe-compatibilty category.
2. **Repeating step (b) at step (e).** From what I gather, we're still not sure why this was done beyond it [fixing some Kannada failures](https://github.com/harfbuzz/harfbuzz/issues/2298#issuecomment-614814994).
Yeah, and fortunately Behdad suggested in that thread that he was definitely open to improving it. If there's a specific test case or two that would probably be ideal; it appears from that thread that the issue just caught him during a week when he didn't have all the test apparatus set up.
Was the example in comment 3 where it actually came up?
Was the example in https://github.com/n8willis/opentype-shaping-documents/issues/48#issuecomment-536838375 where it actually came up?
Gosh. My memory is very hazy at this point, but it appears to be somewhat related to both issues (1) and (2).
I think what I was trying to illustrate with the Gujarati example (which has REPH_POS_BEFORE_POST
behaviour), was that its behaviour in Uniscribe (and HarfBuzz) was inconsistent with steps (d) and (e) of the MS spec.
There are a few things that have come up from our implementation of the final Reph reordering that I thought I'd capture here:
REPH_POS_BEFORE_POST
characteristic:*** What we do is synonymous with OpenType's final Reph reordering step 4, but HarfBuzz and CoreText skip straight to step 5. This makes sense to me, as the second sentence in step 4:
is redundant when followed by step 5 anyway.