w3c / csswg-drafts

CSS Working Group Editor Drafts
https://drafts.csswg.org/
Other
4.46k stars 657 forks source link

[css-text] Render U+2028 LINE SEPARATOR as a forced line break #6992

Closed tabatkins closed 2 years ago

tabatkins commented 2 years ago

Originally posted by Ka-Ping Yee

I'd like to propose that U+2028 be rendered as a forced line break.

The changes to the CSS Text Module Level 3 draft would be minimal; for example:

The rationale is straightforward:

For reference, the Unicode Standard 14.0 defines U+2028 LINE SEPARATOR as an "unambiguous separator character". By my reading, it could hardly be more clear as to what U+2028 is intended to represent, and what the most sensible rendering should be:

5.8 Newline Guidelines

[...]

Line Separator and Paragraph Separator

A paragraph separator—independent of how it is encoded—is used to indicate a separation between paragraphs. A line separator indicates where a line break alone should occur, typically within a paragraph. [...] For comparison, line separators basically correspond to HTML <BR>, and paragraph separators to older usage of HTML <P> (modern HTML delimits paragraphs by enclosing them in <P>...</P>).

[...]

Recommendations

The Unicode Standard defines two unambiguous separator characters: U+2029 paragraph separator (PS) and U+2028 line separator (LS). In Unicode text, the PS and LS characters should be used wherever the desired function is unambiguous.

I'd appreciate hearing your thoughts and suggested next steps on this.

Thanks very much!

kennyluck commented 2 years ago
zestyping commented 2 years ago

Thanks, @tabatkins!

I can't edit the issue description directly, but here it is with the markup fixed up to render correctly on GitHub: [Copied into OP]

xfq commented 2 years ago

I tested the rendering of this character in various browsers and editors, for you reference.


In Chromium it is rendered as a box with a cross: chromium (font is Hiragino Kaku Gothic ProN)

In Firefox, Safari, and iCab, it doesn't display at all.


In Visual Studio Code, the editor will emit a warning when it detects this character. See https://github.com/microsoft/vscode/issues/96142


In Atom, it is not rendered. See https://github.com/atom/atom/issues/12157


In Sublime Text 4, it is rendered as <0x2028>:

sublime

In TextEdit it is rendered as a forced line break.


In GNU Emacs (27.2) it is rendered as horizontal whitespace instead of a line break, even after enabling whitespace-mode.

In Vim (8.2) it is the same.


For the applications I tested, only TextEdit renders this character as a newline.

See also:

zestyping commented 2 years ago

Thank you for doing this research, @xfq !

fantasai commented 2 years ago

I think this issue is filed on the basis of some misunderstandings.

CSS3 Text has, technically, required LS to be treated as a forced break for at least a decade. If browsers are not treating it as such, that should be considered a bug against them. Closing as invalid (not a spec issue).

@zestyping Copied your fixed markup into the OP! Thanks for caring about this issue, I hope your concern can motivate the browsers to fix this longstanding problem.

xfq commented 2 years ago

Browser bug reports: GeckoBlinkWebkit

Since this code point isn't directly mentioned in css-text, I'm not quite sure if we need to add a relevant test in WPT.

fantasai commented 2 years ago

@xfq Tests for any behavior specced in css-text-3, even if indirectly, are welcome in WPT. :) Probably best to do it as a test for all BK/NL characters.

zestyping commented 2 years ago

@fantasai Thank you for clarifying this! I do see now that Section 4.1 did not mean to refer to U+2028 when defining "other space separators".

CSS3 Text has, technically, required LS to be treated as a forced break for at least a decade. If browsers are not treating it as such, that should be considered a bug against them.

Can this be taken as an official statement on the WG's intended interpretation of LS? I would be delighted to know that treating U+2028 as a forced line break is already the behaviour that CSS Text 3 intends to specify!

I can imagine browser developers not finding this to be obvious from the spec. If this interpretation is not clear to them, would it be appropriate for me to point them at this comment thread as an authoritative ruling?

Here is why I suspect they might find it rather subtle. CSS Text 3 mentions many other relevant characters by code point (such as U+000A, U+0020, etc.) and name (CARRIAGE RETURN, IDEOGRAPHIC SPACE, etc.). Yet U+2028 is never mentioned anywhere in the entire spec. Neither LINE SEPARATOR nor its abbreviation LSEP is mentioned anywhere. Neither the "Line Separator" category nor its abbreviation "Zl" is mentioned anywhere. An ordinary person can wonder "I wonder why U+2028 doesn't render as a line break", search for the spec, arrive at CSS Text 3, search the entire document for every imaginable term related to U+2028, and find nothing — indeed, that was my experience, and what led me to file this issue. And, of course, we have the empirical evidence of a decade of browser development oblivious to this rule.

Would the CSS editors be willing to consider making this a little more explicit? I can think of one small change that would clear this all up.

As you pointed out, Section 5.1, bullet point 2 says "lines always break at each preserved forced break character".

Regardless of the 'white-space' value, lines always break at each preserved forced break character: thus for all values, line-breaking behavior defined for the BK and NL Unicode line breaking classes must be honored. [UAX14]

But there is no definition for the term "forced break character" in the spec. If you assume that a "forced break character" has something to do with a "forced line break", then the term "preserved forced break character" is nonsensical: "forced line break" is defined in terms of preserved characters, so there can be no such thing as a non-preserved forced break character. If you instead start by trying to understand the term "preserved", you find that it is defined only as part of the term "preserved white space", wherein the default meaning of "white space" is "document white space characters", which consists of U+0020, U+0009, and segment breaks; so "preserved" has no meaning when applied to other characters like U+2028.

Fixing this is easy; delete the confusing term and simplify the bullet point to:

Regardless of the white-space value, Unicode characters with the mandatory break property (BK) must be treated as forced line breaks. This includes U+000C, U+2028, and U+2029 [UAX14].

(I am omitting VT and NEL here because UAX#14 says "implementations are not required to support the VT character" and "implementations are not required to support the NEL character".)

zestyping commented 2 years ago

@xfq Thank you for filing https://bugs.webkit.org/show_bug.cgi?id=235753 !

frivoal commented 2 years ago

Can this be taken as an official statement on the WG's intended interpretation of LS? I would be delighted to know that treating U+2028 as a forced line break is already the behaviour that CSS Text 3 intends to specify!

I'd agree with that interpretation. css-text-3 states that:

or the BK and NL Unicode line breaking classes must be honored. [UAX14]

UAX14 States that 2028 has non-tailorable BK class, and that “The text after [it] starts at the beginning of the line”.

There's a level of indirection, which may make it non obvious on a casual read, but I think it's unambiguous that this is the expected behavior.

CSS Text 3 mentions many other relevant characters by code point (such as U+000A, U+0020, etc.) and name (CARRIAGE RETURN, IDEOGRAPHIC SPACE, etc.). Yet U+2028 is never mentioned anywhere in the entire spec

css-text-3 mentions those characters where special css-specific processing going beyond (or against) Unicode is needed. For the rest, as stated in 1.5, “CSS is built on Unicode. UAs […] must adhere to all normative requirements of the Unicode Core Standard, except where explicitly overridden by CSS.” So css-text-3 cannot be implemented correctly without referencing Unicode (and in particular UAX14), which in the case of U+2028, gives us a definitive normative answer.

That said, if an editorial chance can make this clearer, I'd be happy to take that on.

Fixing this is easy; delete the confusing term and simplify the bullet point to:

Regardless of the white-space value, Unicode characters with the mandatory break property (BK) must be treated as forced line breaks. This includes U+000C, U+2028, and U+2029. [UAX14]

I don't think this quite works. That covers the BK class, but leaves off preserved segments breaks (U+000A).

Also

I am omitting VT and NEL here because UAX#14 says "implementations are not required to support…

I am interpreting css-text-3 to be going beyond Unicode here, removing the optionality, and adding a requirement that this be supported for the sake of interoperability, so I'd rather keep it.

How about

Preserved segment breaks, and—regardless of the white-space value—any Unicode character with the BK or LN line breaking class, must be treated as forced line breaks. [UAX14] Note: As of Unicode 14, the BK and NL classes include U+000B, U+000C, U+0085, U+2028, and U+2029.

zestyping commented 2 years ago

@frivoal That looks great! I agree with your reasoning. Thank you for the careful review and clarification.

frivoal commented 2 years ago

@fantasai does the proposal at the bottom of https://github.com/w3c/csswg-drafts/issues/6992#issuecomment-1025329457 look reasonable to you, or do you think I missed something?

zestyping commented 2 years ago

@fantasai I see that the first sentence of @frivoal's suggestion made it into https://www.w3.org/TR/css-text-4/:

Regardless of the white-space value, lines always break at each preserved forced break character: thus for all values, line-breaking behavior defined for the BK and NL Unicode line breaking classes must be honored. [UAX14]

but not the second sentence:

Note: As of Unicode 14, the BK and NL classes include U+000B, U+000C, U+0085, U+2028, and U+2029.

Any particular reason why this should not be included? I realize these code points are implied by reference to UAX14, but it seems nice to be explicit, especially given that plenty of other code points are mentioned by number in this draft.

fantasai commented 2 years ago

@zestyping As noted in https://github.com/w3c/csswg-drafts/issues/6992#issuecomment-1023682885, that sentence was always there: https://www.w3.org/TR/css-text-3/#line-break-details

fantasai commented 2 years ago

Updated the specs to use Florian's rephrasing. As for a note listing all the individual codepoints... I think it's better to just make sure there's testcases in WPT.