Open frivoal opened 5 years ago
It feels to me that the intention / rationale behind this was to be a character that adds a word- (and hence possible line-) break without changing any other behavior.
Does it make sense to add a word-break without affecting shaping? That sounds counter intuitive to me.
@fantasai I've seen you make comments in various places about ZWSP that implied that you thought it ought to break shaping (here's just one), so I'd be interested in your feedback on this.
https://unicode.org/reports/tr44/#Release_Stability (Date 2019-02-27)
says §2.3.1 "Updates to character properties [...] may be required [...] to change the assigned values for a property". [...] "For example, U+200B ZERO WIDTH SPACE was originally classified as a space character (General_Category=Zs), but it was reclassified as a Format character (General_Category=Cf) to clearly distinguish it from space characters in its function as a format control for line breaking"
1) It follows that, as a Format character, ZWS can also serve "to indicate word boundaries" as raised by @r12a.
2) Is ZWS still a "risky" character for a stable implementation?
@frivoal, I didn't find ZWS as "Transparent" in the link "https://www.unicode.org/Public/UCD/latest/ucd/ArabicShaping.txt". Am I missing something?
@frivoal, I didn't find ZWS as "Transparent" in the link "https://www.unicode.org/Public/UCD/latest/ucd/ArabicShaping.txt". Am I missing something?
Read the file header re missing values.
I didn't find ZWS as "Transparent" in the link "https://www.unicode.org/Public/UCD/latest/ucd/ArabicShaping.txt". Am I missing something?
It's not listed explicitely, but it is covered by the generic rule:
Note: Code points that are not explicitly listed in this file are either of joining type T or U:
- Those that are not explicitly listed and that are of General Category Mn, Me, or Cf have joining type T.
ZWS is in category Cf.
cc @roozbehp again, since I think he can articulate the reasoning behind this best.
Does it make sense to add a word-break without affecting shaping? That sounds counter intuitive to me.
My understanding is that ZWSP is inherently a line-break control character. So it's misleadingly named at best. It's closer to SOFT HYPHEN, than to a space. The difference from SOFT HYPHEN seems to be that this one is not expected to turn into a hyphen if line break does happen. I think the original use case was to be used with scripts that don't use inter-word spaces, to mark line break opportunities. It feels to me that ZWSP and SOFT HYPHEN should have been one character, to mark "line break allowed".
Other example of ZWSP is to mark break opportunities / word boundaries in concatenated words like long URLs or hash tags like "justanotherawesomelylongurl". The idea is that whether or not you mark a location as line-break opportunity should be separate from whether Arabic shaping happens. So, if Arabic shaping is not desired, one should use a ZWNJ to control that, separately from ZWSP.
Another explanation is that many characters, like ZWSP, only affect one aspect of Unicode processing and are ignored for all other processes. This is done to manage complexity. Such that instead of having to specify behavior of each control / format character on every process on every script, this can be specified independent of scripts for the most part.
Anyway. Just my understanding. As I said, I was also surprised by this, but I understand what the rationale / thinking has been.
I'm also not convinced this ZWSP should be used with Nastaliq...
I'm also not convinced this ZWSP should be used with Nastaliq...
I could be convinced that you're right there. It certainly doesn't seem a good idea given what TUS expects wrt joining behaviour.
I note, however, that Firefox and Edge both break the cursive joining, although Chrome and Safari don't. See https://r12a.github.io/pickers/urdu/?text=%DB%81%D8%B1%E2%80%8B%D8%B4%D8%AE%D8%B5%E2%80%8B%DA%A9%D9%88%E2%80%8B%D8%A7%D8%B3%E2%80%8B%D8%A8%D8%A7%D8%AA%E2%80%8B%DA%A9%D8%A7%E2%80%8B%D8%AD%D9%82 for an example.
Note further, however, that a difference ZWSP and soft hyphen is that ZWSP is expected to expand when letter spacing or justification is applied to a line. This makes it not quite a simple line-break opportunity indicator. (TUS describes it as "indicates a word break or line break opportunity" and goes on to describe how justification algorithms are likely to add space on p872 of TUS version 12.) I'm wondering now how that fits with justification of cursive scripts.... Would it be expected to create kashida-like behaviour when justified, but shrink back to nothing otherwise (unlike tatweel)?
I note, however, that Firefox and Edge both break the cursive joining, although Chrome and Safari don't.
I'm fairly sure Firefox breaks it just because the font doesn't have ZWSP. If you use an Arabic font that does have ZWSP I expect Firefox to NOT break shaping. Quite possibly the same about Safari.
You're right re justification. At this point I can't make up my mind about the exact intended use of ZWSP anymore.
The text on justification and ZWSP in TUS is on p872, in Volume 23. I see no difference in behaviour between ZWSP and soft hyphen for justification; if the line-break opportunity is not taken, then both characters are simply ignored; justification behaves as though they were not present. The issue seems to be mentioned for ZWSP because someone once thought that ZWSP suppressed the expansion of inter-character spacing.
A rendering difference between ZWSP and soft hyphen is that when the line break opportunity is taken, a soft hyphen has hyphenation effects, such as the appearance of a hyphen and changes in spelling. (Some varieties of Thai use hyphenation with hyphens.) Another difference is that ZWSP marks a word boundary whereas a soft hyphen has no such effect; this matters for spell checkers.
I'm also not convinced this ZWSP should be used with Nastaliq...
Agreed. Nastaliq text (just like the same text if it were written in a "simple" Arabic font) would use U+0020 for inter-word spaces. A Nastaliq font might give its space glyph a very narrow (or even zero) default advance width, but it's still U+0020.
So does anyone know of any circumstances in which it would make sense to use ZWSP inside a word or sequence of characters in a script that is cursive?
Would control of justification behaviour be such a circumstance? (Can't see it really, but if so, we need to ask further questions about how that would work.)
For that matter, does ZWSP have a useful application in any script outside those that don't separate words with spaces (such as Khmer, Japanese, etc.)?
Would control of justification behaviour be such a circumstance? (Can't see it really, but if so, we need to ask further questions about how that would work.)
As someone pointed out, ZWSP doesn't have any justification behavior either. It's as if it wasn't there. The clarification in Unicode saying that it might stretch simply means that it might stretch the same way as it might without ZWSP.
For that matter, does ZWSP have a useful application in any script outside those that don't separate words with spaces (such as Khmer, Japanese, etc.)?
I suppose not, because cursive scripts use hyphenation and as such should use soft-hyphen instead.
I for one regularly use ZWSP to provide line-breaking opportunities for long ASCII file paths, as breaks at soft hyphens could be mistaken for breaks at hyphens.
I for one regularly use ZWSP to provide line-breaking opportunities for long ASCII file paths, as breaks at soft hyphens could be mistaken for breaks at hyphens.
Right. That's other legit usecase for them.
Right. That's other legit usecase for them.
Watch out though, because https://r12a.github.io/scripts/punctuation/block.html#char200B works, whereas https://r12a.github.io/scripts/punctuation/block.html#char200B fails. And it's not at all clear why, to the user who may have copy/pasted the URL into github, email (or many other applications). Sounds like a dangerous thing to recommend generally.
Ok, so the conclusion seems to be that this might be surprising if you don't think about it too much, but that people have thought this through, and that ZWS is more like a soft hyphen, and that not breaking shaping is very much intentional, and documented in unicode.
There might be problems in browsers about failing to the do the correct shaping if Zero Width Space is missing from the Arabic font, but that's an implementation matter, not a question about whether ZWS is supposed to break shaping.
So, I was feeling ready to close this with no action, but…
It seems that neither MS Word (regardless of font), nor InDesign (regardless of font) nor Firefox (regardless of font) or EdgeHTML (regardless of font) respect that, and they do break shaping on Zero Width Space always. Additionally, whether Chrome breaks shaping or not depends on the font: it shapes with most good Unicode/Arabic fonts, but not with (some?) default system fonts.
Fonts tested: Arial Unicode MS, Segoe UI, Noto Naskh Arabic, Code2000, Adobe Arabic, Calibri.
So, is this a situation where all/most implementation need to get their acts together and get fixed to match the standard, or is the standard fiction that needs to get fixed?
@r12a re: https://github.com/w3c/csswg-drafts/issues/3861#issuecomment-488767899
So does anyone know of any circumstances in which it would make sense to use ZWSP inside a word or sequence of characters in a script that is cursive?
If the sequence does not otherwise contain any spacing or spaces characters, then no.
Would control of justification behaviour be such a circumstance? (Can't see it really, but if so, we need to ask further questions about how that would work.)
Justification should entirely ignore ZWSP, in particular, it should not assign a non-zero width to it.
For that matter, does ZWSP have a useful application in any script outside those that don't separate words with spaces (such as Khmer, Japanese, etc.)?
Yes, as has been mentioned above.
Posted as an issue to the UTC. I guess we wait for their response.
I can't link to it because they (still!) don't use a public bug tracker. :/
I'm the original commenter, so I guess "Commenter Response Pending" is for me. I agree this is out of scope for the css-text-3 spec, so I'm OK with closing it.
But even if it isn't in scope for the spec, it is relevant for linebreaking, affects tests and implementations, and it would be good to put this question to rest. UTC not having a public tracker is unfortunate. Is there a way (not necessarily a URL or an API, a human contact is OK as well) to track the status?
@fantasai what did you say to the UTC? (and where did you post it?)
fwiw, I created some interactive tests. See https://w3c.github.io/i18n-tests/results/int-cursive for a summary, and click on #26 to see the detailed results, listed by font, at https://github.com/w3c/character_phrase_tests/issues/26
@r12a Basically posed the question @frivoal asked in https://github.com/w3c/csswg-drafts/issues/3861#issuecomment-529348086 - should implementations be changed to match the Unicode spec, or should Unicode be adjusted to match implementations?
Leaving it open; waiting on UTC reply. The comment was posted through their official feedback channel which, as I mentioned, does not have any public-facing tracking.
FYI - it is tracked in F5
of https://www.unicode.org/L2/L2020/20108-properties-feedback.pdf and will probably be discussed in UTC #163
this week.
Admittedly, I don't know all that much about Unicode, but I'm a little puzzled about the (planned) response documented in the file @xfq linked to. The bug report by @fantasai roughly says "the spec doesn't match implementations, maybe the spec should change to match implementations", and a part of the response boils down to be "no, that would break implementations". The other part of the response, that setting ZWSP's General_Category to Cf rather than Zs was very deliberate and justified, seems perfectly reasonable, and even might possibly be enough to go against compatibility concerns (though I'm not the right person to make this call). But citing compatibility concerns in support for keeping the specification unchanged when faced with a claim that implementations do not follow the specification is perplexing.
It seems to me that a little bit more testing would be appropriate. I did not do extensive testing, but the implementations I did find to be in violation of the spec are pretty major ones. InDesign, MS Word, LibreOffice, Firefox, EdgeHTML, Chrome (depending on the font), Apple Mail & Apple TextEdit (so presumably the built-in text component of macOS)…
The (proposed) response suggests that they "Respond to Elika, informing them that the UTC declines to change the General_Category of U+200B Zero Width Space". But AIUI, Elika's report did not specifically ask for the General_Category to be changed; it only queried the Arabic shaping behavior.
I believe our concern here would be adequately addressed by just adding an entry for ZWSP to ArabicShaping.txt, assigning it joining type U (rather than the default T for characters of General Category Cf. At the point (16 years ago) when ZWSP was changed from GC=Zs to GC=Cf, it doesn't look like Arabic joining behavior was considered.
The primary use for ZWSP, I think, is to control the provision of potentialLineBreakPositions within long strings of otherwise-unbreakable text (e.g. in paths, or in scriptio continua writing systems), where it means, more or less, "word boundary with no visible space". As such, I think it is correct for it to interrupt cursive joining: if I write Arabic words without spaces between them, I'd still expect to interrupt joining at the word boundaries (it's somewhat analogous to the use of camelCase when writing a multiWordEnglishLanguageIdentifier).
So I think the proposed response is answering the wrong question. We're not requesting a change of General Category but a change of Arabic Joining Type.
+1 to @jfkthame.
Here's a link to a revised recommended UTC action: https://www.unicode.org/L2/L2020/20108r-properties-feedback.pdf (not discussed in UTC yet)
FTR, I sent a note to the UTC with essentially the same content as my comment here yesterday, which has led to the revised feedback doc linked above. If (as I expect) the UTC accepts that recommendation, the next step will be to follow up with additional documentation (assuming we want to pursue this).
Link to the draft minutes of UTC: https://www.unicode.org/L2/L2020/20102.htm#163-A74
Unicode's response is that the decline to change. I find the rational a bit light, but nevertheless, we seem to have reached the end of this. Time to close?
Actually, the UTC response reads to me as "No change yet. Write a formal request, giving a full justification and an assessment of the impact of the change."
On the camelCase analogy, ZWNJ will break the cursive connections quite nicely. If one wants line break opportunities between the elements, then an additional ZWSP offers the scope, as an additional effect.
I had a look at some well-written handwritten Tai Tham. There's scope for the subscript consonants and vowels of adjacent clusters to clash. I noticed that clashes below were avoided, and that the presence of zero width word boundaries didn't seem to affect the avoidance strategies. So shaping in general carries on across ZWSP. I note further that U+00AD SOFT HYPHEN is also transparent, and have been told that the Arabic joining rules are continued across line breaks. That argues that joining should continue across ZWSP. Of course, it doesn't say whether complex ligatures (e.g. vertical stacking) should be allowed across ZWSP.
The Unicode standard offers the following instructions:
Instruction 1) Break joining, no line break: ZWNJ 2) Break joining, allow line break with hyphen: ZWNJ, 00AD or 00AD, ZWNJ 3) Break joining, allow line break without hyphen: ZWNJ, ZWSP or ZWSP, ZWNJ 4) Continue joining, no line break: (default) 5) Continue joining, allow line break with hyphen: 00AD (TBC) 6) Continue joining, allow line break without hyphen: ZWSP
The proposed change would remove the ability to command no. 6.
The proposed change would disallow option 5.
shaping in general carries on across ZWSP
I don't think this is particularly relevant. Shaping in general carries on across everything (including non-zero-width spaces as well). For a nice example, see the Awami Nastaliq font, for example (check the examples of "Diagonal Cluster Fitting" at http://software.sil.org/awami/what-is-special/).
The question here is only about whether ZWSP should be class T or U for the purposes of the Arabic Shaping property. This is unrelated to whether shaping (as a general mechanism) may take effect across it.
Point that was brought up in a discussion with @frivoal: Arabic letters should not change shape depending on whether there was a line break or not. That's the correct behavior for line-break: anywhere
and the correct behavior in the presence of hyphenation. So breaks caused by ZWSP should have consistent shaping behavior whether the line breaks at that point or not.
This is probably more of a unicode issue than a css issue, but we have a fair bit of people involved with text layout and i18n over here, so filing it here first to figure out if we should take it to unicode or not.
When writing https://github.com/web-platform-tests/wpt/pull/14673, I had misread the unicode standard, and though that ZERO WIDTH SPACE was supposed to break arabic shaping, based on a table that said "all spacing characters" do so. But there's a distinction between "spacing characters" and "spaces characters", and ZERO WIDTH SPACE is part of the later, not the former.
https://www.unicode.org/Public/UCD/latest/ucd/ArabicShaping.txt gives further details about which character does what to shaping, and classifies ZERO WIDTH SPACE as T (transparent), which neither forces nor breaks shaping, and just behaves as if it wasn't there for shaping purposes.
So Unicode has a definite answer as to what's supposed to happen, but several people in the thread about my tests were surprised by that answer (including @behdad, @r12a, and myself), because ZERO WIDTH SPACE is used as a word divider, and that suggests it ought to be breaking shaping. @r12a brought up nastaliq as a reasonable use case, because:
So, what do we collectively think? Is unicode likely enough to be mistaken that we should raise this issue with them? Is there a know good reason for why things are the way they are?