Closed adrianwong closed 3 years ago
So, my reading of the Devanagari chapter and the "special areas" chapter in Unicode 12 leads me to think that the HarfBuzz approach is right. On the grounds that ZWJ/ZWNJ are "control characters" which are intended to scope only to the immediately preceding and immediately subsequent codepoints. Or, perhaps, to scope only to the slot between them.
That intention sounds like what Zachary's screenshot (last para specifically) refers to -- "zero width joiner and zero width non-joiner are format control characters. As such, and in common with other format control characters, they are ordinarily ignored by processes that analyze text content. For example, a spell-checker or a search operation should filter them out when checking for matches."
Like, the ZWJ/NJ are designed to function as a scalpel, not to have ripple effects further and further out.
If a ZWJ in "A,ZWJ,B,C" affected not just the ligation between "AB" but also between the "BC", then someone could make a similar argument that it might also affect cursive connection between "BC" and that result seems almost certainly wrong.
I mean, perhaps it all just boils down to whether or not backtrack & lookahead could be considered "analyzing text content." It seems defensible to just say "yes" flat out; syllable identification is also pretty "analyze-text-content"y, and the regexes make sure that ZWJ/ZWNJ don't crash that.
Okay; LAST proof-texting on this, I promise.... There's also this note in Ch. 23.2: "Moreover, they are essentially requests for the rendering system to take into account when laying out the text; while a rendering system should consider them, it is perfectly acceptable for the system to disregard these requests."
Which seems like it's saying you don't cross the line from "correct" into "incorrect" at least.
Thanks for the deep dive, Nathan. Should we add this behaviour to the errata document?
Yeah. I'll also note the odd "f,ZWJ,i" ligature-lookup suggestion that Behdad pointed out as an erratum (assuming that's a word). It's still around in US 12.1, and still certainly not industry standard plus probably outright wrong.
I think it's also worth making a paragraph about the ZW[N]Js a bit more prominent in the Indic docs, up near the beginning, so that they don't take readers by surprise later on. Perhaps being more explicit about them from the get-go will be less confusing. Will give that a try anyway.
In a74fe72 I've added a subsection explicitly discussing backtrack/lookahead with ZWJ/ZWNJ and attempting to frame that concern in a more grounded context. Eyes welcome.
It also (hopefully) bumps up the prominence of the section by giving it a header, with the goal of making it harder to miss & thus getting new readers thinking about the issues earlier-on in hypothetical future implementation projects.
Doing that now because I think this is at long last the end of untangling the various overlapping ZWxJ problems, and I can see that structurally we'd want to treat Dotted Circle handling in a similar fashion, which is the next knot. In any event, comments on that commit are appreciated, but they're not urgent.
I believe this is closable now, via #121; feel free to reopen if necessary.
HarfBuzz ignores any ZWJs encountered during backtrack/lookahead, but not in input (unless a flag is enabled). There is an interesting discussion for/against this behaviour within the context of Indic shaping, but the issue was closed without resolution.
Consider one of the simpler examples from our test corpus (Wikipedia dump, so this syllable is out there "in the wild"):
<क U+0915, U+200D, ी U+0940>
There exists a chaining contextual lookup in Noto Serif Bengali where the glyph for < ी U+0940> is the input, and the glyph for <क U+0915> is the backtrack. On a successful match, the glyph for < ी U+0940> undergoes a single substitution, yielding:
HarfBuzz skips the backtrack ZWJ, so its output is identical to the above, whereas Uniscribe and Allsorts allow the ZWJ to inhibit the substitution of the < ी U+0940> glyph, yielding:
For this example, I would assume that the former would be a more desirable output. However, especially in the context of Indic shaping, is it "correct" for a shaping engine to selectively ignore the existence of ZWJs?