w3c / csswg-drafts

CSS Working Group Editor Drafts
https://drafts.csswg.org/
Other
4.51k stars 671 forks source link

[css-text-4] Remove collapsible line breaks adjacent to word separators #3481

Open fantasai opened 5 years ago

fantasai commented 5 years ago

We have rules in place that eliminate line breaks if they are adjacent to ZWSP, leaving behind the ZWSP when assembling the paragraph text form multiple lines of source text. However, we didn't consider explicit word separators such as the Ethiopic word space. Probably all “word separators” (other than space and nbsp) should have the same behavior as ZWSP here.

asmusf commented 5 years ago

Tibetan intersyllabic tsheg?

fantasai commented 5 years ago

@asmusf I was thinking about that, yes. Also the Ogham space mark.

r12a commented 5 years ago

Seems to me that there are a number of characters that are or were (in archaic script use) used in place of spaces. But i'm surprised that this is an issue. @fantasai could you point to the part of the spec that is in question?

fantasai commented 5 years ago

@r12a https://www.w3.org/TR/css-text-3/#line-break-transform Where we handle ZWSP, it might make sense to handle other word separators that aren't spaces.

r12a commented 5 years ago

I suspect that what distinguishes ZWSP and TSEK in these circumstances is that [thai etc character][zwsp][whitespace] is likely to be an error, whereas [tibetan character][tsek][white space] is not (even when spaces in tibetan would theoretically use NBSP), or if it is an error this can only be detected by understanding the text and/or the intention of the author. Same goes for ethiopic word space.

I suspect that, mostly, content authors just need to be careful about how they compose the source text, so that spans of text that shouldn't include spaces don't, even if they are using an editor or tool that wraps lines automatically. It seems to me that that's also the approach you'd need to take when composing text in archaic hangul styles, where they didn't use spaces between words.

(Btw, this probably has implications for some aspects of Semantic linefeeds if the language of the text used doesn't employ spaces as word separators.)

fantasai commented 5 years ago

I think the goal should be that other languages are not at a significant disadvantage in how they organize their source code, i.e. make semantic linefeeds possible for all languages where we can plausibly do so without breaking existing content.

fantasai commented 5 years ago

I'm not sure what that means for what characters we should consider... I'm pretty sure that the Ogham space mark and Ethiopic word space should collapse with subsequent spaces, it doesn't make sense to want both. But for Tibetan, I'm not sure, does it really use spaces after tsek marks? (I know they do after shad, but that's a different character.) @r12a

css-meeting-bot commented 5 years ago

The CSS Working Group just discussed Collapsible breaks adjacent to word separtors.

The full IRC log of that discussion <fantasai> Topic: Collapsible breaks adjacent to word separtors
<heycam> github: https://github.com/w3c/csswg-drafts/issues/3481
<heycam> fantasai: we generally have this concept in CSS and HTML that you can use white space to format your source, and we collapse white space down to a single space
<heycam> ... including line breaks
<heycam> ... for Chinese and Japanese which don't use spaces, we have some rules to remove the space otherwise you will be forced to put all paras on one line
<heycam> ... there are some rules for doing that based on character classes
<heycam> ... what we didn't consider thoroughly is languages that use a word separator that's not a space
<heycam> ... we do special case ZWSP, for Thai and other languages
<heycam> ... but we don't have something similar for Ethiopic word space
<heycam> ... probably don't also want a regular space there
<heycam> ... proposal is when there's a word separator character adjacent to a line break, the line break just goes away
<heycam> ... I think the characters that are affected here are Ogham space mark and Ethiopic word space and the Tibetan tsek
<heycam> AmeliaBR: does this map to something in Unicode? or do we need to maintain this list?
<koji> https://drafts.csswg.org/css-text-3/#word-separator
<heycam> r12a: I think there is something, not sure if it's fit for this purpose
<heycam> r12a: archaic scripts have other examples
<heycam> y
<heycam> fantasai: [reads definition in the spec right now for word-spacing]
<heycam> florian: we need to maintain a list
<heycam> myles: let's ask Unicode to do it
<heycam> ... if there is such a facility for these character lists, hard to believe it's specific for the web platform
<heycam> ... and not needed in text editors for example
<heycam> ... I don't think the web specs should maintain this list
<heycam> florian: I agree with part of your statement, should try to work this out with Unicode
<heycam> ... this one specifically maybe, but some are specifically web platform relatively
<heycam> ... since this is relevant to turning HTML markup into text
<heycam> myles: there are many different markup languages...
<heycam> fantasai: there are 2 questions
<heycam> ... if we want to do this, and then whether we maintain the list of if Unicode should
<heycam> addison: i think we want to do some research
<heycam> ... space or no space is a classic problem
<heycam> ... I would be surprised if there weren't something, but don't know off the top of my head
<heycam> ... would be happy to engage
<heycam> myles: if this is a classical problem, it's been solved, and we should figure out how it's been solved in the past and re-use that solution
<heycam> fantasai: looking at some of the stuff in css-text, weh ave a concept of word separateors
<heycam> ... and it includes a set of code points
<heycam> ... it excludes Ogham space mark
<heycam> ... since it would cause text to not join any more
<heycam> ... so general usage in UNicode is text processing segmentation is not going to account ofr that concern, since they don't deal with typesetting
<heycam> ... so there's gonna be some aspects of how we're using Unicode codepoints with sepecific requirements that haven't come up in Unicode's context so far
<heycam> ... unbreaking lines is something that's been hard to explain to them
<heycam> myles: maybe we shouldn't be unbreaking them?
<heycam> fantasai: too late for that!
<heycam> addison: fwiw I've had to write this code in the past, and it's not any fun
<heycam> ... it maye have been individually solved but not written down
<fantasai> fantasai: HTML has been unbreaking lines for as long as it has existed, we want to make that ability available to more languages
<heycam> r12a: like with the other issues, we need to look in more detail
<heycam> ... the Tsek is a syllable separator, not the same as a word joiner
<heycam> ... you could end a line with a Tsek, then start with more Tibetan on the next line, with indentation, and no real reason to join those together necessarily
<heycam> fantasai: you wouldn't make the Tsek go away, just avoid the extra space going in there
<heycam> ACTION: i18n to look this issue of word separators next to newlines
<trackbot> Error finding 'i18n'. You can review and register nicknames at <https://www.w3.org/Style/CSS/Tracker/users>.
<addison> action: addison: ensure we respond to css 3481
<trackbot> Error finding 'addison'. You can review and register nicknames at <https://www.w3.org/Style/CSS/Tracker/users>.
css-meeting-bot commented 4 years ago

The CSS Working Group just discussed Removing collapsible linebreaks", and agreed to the following:

The full IRC log of that discussion <TabAtkins> Topic: Removing collapsible linebreaks"
<astearns> github: https://github.com/w3c/csswg-drafts/issues/3481
<TabAtkins> fantasai: Proposal is to defer to level 4
<TabAtkins> astearns: Anyone concerned about punting?
<TabAtkins> astearns: reading thru the issue, lots of words I don't know...
<TabAtkins> astearns: We discussed previously and didn't get a conclusion
<TabAtkins> fantasai: Looks like it'll need more research and digging.
<TabAtkins> fantasai: I think we should get the spec done and defer this.
<TabAtkins> RESOLVED: Punt "removing collapsible linebreaks adjacent to work separators" to level 4