w3c / csswg-drafts

CSS Working Group Editor Drafts
https://drafts.csswg.org/
Other
4.5k stars 668 forks source link

[css-text-4] Don't provide a language parameter for word-boundary-detection #7193

Closed r12a closed 1 year ago

r12a commented 2 years ago

2.2.1. Detecting Word Boundaries: the word-boundary-detection property

auto() This value directs the user agent to perform language-specific content analysis to determine where to insert virtual word boundaries.

<lang> must be a valid CSS <ident> or <string>. It represents an IETF BCP 47 language range (see [BCP47]). If the UA does not support word-boundary detection for all languages represented by the specified range, that specified value is invalid (and will cause the declaration to be ingored).

Fantasai provided some additional explanation about this feature, which explains why a range of language tags would be used:

I think the idea of this parameter was that it enables word-boundary-detection for the specified language(s) only; other languages are not affected. What language the element is actually in is what determines the language used for the word-boundary-detection. For example, if I have a trilingual document in English, Japanese, and French, if I set word-boundary-detection: auto(ja) then it will enable detection for Japanese paragraphs only. If there are no Japanese paragraphs, it won't have any effect.

The i18n WG thinks that the choice of whether or not to apply the word boundary detection algorithm should be set by applying the word-boundary-detection styling to the relevant content. The language information used should be that provided by the lang attribute, and not supplied as a parameter with this property value.

We don't think the content author is able to guess what languages are supported by the user agent, so it doesn't seem useful to make them specify language in the property value. We think that the approach currently described in the spec also requires the content author to have a level of understanding about language tagging that is too high (see the examples about Cantonese).

We think that there should also be a simple recommendation that user agents SHOULD NOT apply a boundary detection algorithm to text in a language for which the algorithm is not defined (modulo decisions wrt dialect support).

css-meeting-bot commented 2 years ago

The CSS Working Group just discussed don't provide a lang param for word boundary.

The full IRC log of that discussion <TabAtkins> topic: don't provide a lang param for word boundary
<TabAtkins> https://github.com/w3c/csswg-drafts/issues/7193
<TabAtkins> github: https://github.com/w3c/csswg-drafts/issues/7193
<fantasai> florian: In Text L4, we have a property to detect boundaries
<fantasai> florian: and another property to use boundaries
<fantasai> florian: e.g. in Japanese might want to word-break headings
<fantasai> florian: If not marked up in the source, can use this property to auto-detect the word breaks
<fantasai> florian: In CSS, ask for detection through this property
<fantasai> florian: and say which languages you want to auto-detect
<ChrisLilley> q+
<fantasai> florian: Implementers also want to not specify the language
<fantasai> florian: Problem is as an author, how do you know what's supported?
<emilio> q+
<fantasai> florian: e.g. if you say autodetect Thai, how do you fall back if the browser doesn't support auotdetection for Thai?
<Rossen_> q
<fantasai> florian: In Japanese, normally you can break everywhere. If you turn that off, and only break at word boundaries, you will not wrap anywhere if you fail detection
<fantasai> florian: If you turn off normal line-breaking, and turn on one that requires word boundaries to be inserted, and you fail to insert them, the result will be very bad
<fremy> q+
<Rossen_> ack ChrisLilley
<fantasai> ChrisLilley: I think what I see the i18n group is to combine this with actual use of lang attributes in markup, and if you don't do that you get what you get
<fantasai> ChrisLilley: I think they're trying to push for "use proper lang tags"
<fantasai> florian: Specced to require markup lang tagging as well
<fantasai> florian: You have to match on both sides. If you ask for "auto-break Japanese", it will only apply that to elements that are also tagged as Japanese
<fantasai> florian: Browsers can maybe guess, but ...
<fantasai> ChrisLilley: Has this been articulated to the i18nwg?
<fantasai> florian: I can write it up in the issue
<fantasai> emilio: It's weird to make syntax parse depend on something that may be system dependent
<Rossen_> ack emilio
<fantasai> emilio: e.g. if you make system API to do the line break, then can only line-break if the system supports it
<fantasai> emilio: ...
<fantasai> emilio: It's a bit of a hassle to pass that back through the CSS parsing layer
<Rossen_> ack fremy
<fantasai> fremy: For the fallback, seems extremely dangerous that you would depend on this
<fantasai> fremy: it seems to be more prudent to say, if the UA is not able to find breaks in the text, then don't apply
<fantasai> fremy: doesn't mean you don't parse, but if you're supposed to break, and cannot find a place to break, sounds lie a bug
<fantasai> florian: This property doesn't control line-breaking
<fantasai> florian: if you want to control line-breaking, you use the usual properties
<fantasai> florian: and you can put Unicode chars to indicate line breaks
<fantasai> florian: What this does is to effectively inject into the markup the Unicode characters you didn't put yourself
<fantasai> florian: We could change to make the property to autodetect and also control line-breaking, but already so complicated
<fantasai> florian: Also this property is not just for line-breaking,
<fantasai> florian: can detect word boundaries to insert spaces
<fantasai> florian: can insert spaces, or line break, or both
<fantasai> florian: If you don't have the characters in the markup and they fail to be auto-injected, then you won't wrap anywhere, and that's a big problem
<fremy> q?
<fantasai> florian: Maybe we can find a different way to do fallback, but if there's a risk of content overflowing instead of wrapping, people can't use this property
<fantasai> dbaron: Want to echo what Chris says, which is we shouldn't do too much based on this feedback until we make sure r12a understands why we did this in the first place
<fantasai> dbaron: it's not clear in the spec text, and not clear in the issue
<Rossen_> ack dbaron
<fantasai> dbaron: so go back and explain why you did it this way first
<fantasai> fantasai: Sounds like the plan is for FLorian to go back and explain why this works the way it does in the issue, and also clarify in the spec
<fantasai> Rossen_: Anything else on this issue?
<dbaron> (r12a might still disagree, but we should find out first, and then perhaps discuss again)
<fremy> @ fantasai Thanks, this was a misconfiguration at my mike level, I had set gain to minimum ^_^
<fantasai> florian: I'll clarify
FremyCompany commented 2 years ago

(Question: would it be reasonable to enforce user agents who do not support a language to insert word boundaries chars everywhere a break is allowed normally, to use as a fallback?)

frivoal commented 2 years ago

We don't think the content author is able to guess what languages are supported by the user agent

Agreed. But to me, that supports that current design. Here's an example:

If you want to do "word" based line breaking for titles in Japanese, instead of the typical between-every-letter line breaking, assuming you have <wbr>s (or U+200B) in your markup, you can do this:

h1:lang(ja) {
  word-break: keep-all;
}

If you don't have <wbr>s (or U+200B) in your markup, and want to auto detect the word boundaries, you wouldn't want to merely do this:

h1:lang(ja) {
  word-boundary-detection: auto(ja);
  word-break: keep-all;
}

Because if the UA doesn't know how to do boundary detection in Japanese, the text will overflow, due to a lack of wrapping opportunities. So instead, what you'd do is something like that:

@supports( word-boundary-detection: auto(ja) ) {
  h1:lang(ja) {
     word-boundary-detection: auto(ja);
     word-break: keep-all;
  }
}

But if we change the spec not to supply a language parameter to the word-boundary-detection property, you can no longer do that.

(Question: would it be reasonable to enforce user agents who do not support a language to insert word boundaries chars everywhere a break is allowed normally, to use as a fallback?)

If word-boundary-detection was exclusively for line breaking, I suppose that could work, but if you're using it with word-boundary-expansion, then that's the wrong fall-back. In that case, if word-boundary-detection doesn't work for the target language, you'd want to do no expansion, rather than expansion between (almost) every letter.

litherum commented 1 year ago

I fully, 100%, absolutely agree with the OP here. In fact, before seeing this issue, I just sent an email to a colleague suggesting this, and describing why the current behavior doesn’t make any sense.

From my email:

The worst thing that can happen if the browser doesn’t have a dictionary for a particular language is that it falls back to the default behavior of boundary analysis (as-if word-boundary-detection wasn’t specified at all). …

I don’t know of any other text layout system which has this language based range behavior. In every other publishing system I’ve seen, this dictionary-based approach is either (a) automatically enabled and always on, or (b) an opt-in with a single boolean switch.

I honestly don’t understand the backwards-compatibility story described above in this thread. If a publishing house cares very much about exactly where their line breaks are, they won’t use this property, because different browsers and OSes will implement it differently. Therefore, it’s totally OK if this property has progressive enhancement; the author doesn’t know where the line breaks will be anyway - they are just telling the browser “do your best to improve the quality, possibly at the expense of performance.”

kojiishi commented 1 year ago

+1 to @litherum, doing this in the CSS value syntax doesn't look right to me too.

Blink is planning to implement this feature (Japanese natural line breaking) in Q3. Great if the WG can reconsider the syntax before that, but otherwise, we'll go with the current syntax.

litherum commented 1 year ago

We are implementing this now also, and would ideally like a resolution soonish.

frivoal commented 1 year ago

I honestly don’t understand the backwards-compatibility story described above in this thread. If a publishing house cares very much about exactly where their line breaks are, they won’t use this property, because different browsers and OSes will implement it differently. Therefore, it’s totally OK if this property has progressive enhancement; the author doesn’t know where the line breaks will be anyway - they are just telling the browser “do your best to improve the quality, possibly at the expense of performance.”

I think this is a sign that this should not be on a dedicated property, and that it does belong as a special value of word-break.

Indeed, as a special value of word-break, as you said, if the browser doesn't know how to do the detection for the given language, it falls back to normal line breaking, and that's fine.

If it is a value on a separate property with the behavior proposed for word-boundary-detection there is a problem to be solved. I now think it is the wrong solution, but there was a problem: word-boundary-detection doesn't make words stay together on their own, it merely introduces <wbr> equivalents where the boundaries belong, and counts on the author separately turning on word-break: keep-all to get the correct line breaking. But if keep-all gets turned on on a piece of text where the browser doesn't know how to detect the boundaries, then instead of falling back to normal line breaking, you fallback to just keep-all, which means mostly no breaking at all.

There was a reason for having it as a separate property (the injected boundaries can be used for other purposes), but still, the awkwardness of this language parameter show that this is likely the wrong design. refactoring this into a new word-break value (and rearanging word-boundary-expansion to work differently) will work better.

litherum commented 1 year ago

Oh, I forgot to say earlier: we are hooking this up to CFStringTokenizer, which doesn’t have API (or SPI) that exposes the languages that have supported dictionaries. Nor should it (for the reasons described earlier in this thread). So this auto() function isn’t really implementable for us as-is.

this should not be on a dedicated property

ICU seems to have put this as one of the alternatives of word-break: https://unicode.org/reports/tr35/#:~:text=%22lw%22-,Line%20break%20word%20handling,-%22normal%22 maybe we should too? I don’t really have an opinion other than we should figure this out soon, because there are (at least) 2 active implementations.

kojiishi commented 1 year ago

+1 to make this a new value of word-break.

litherum commented 1 year ago

https://github.com/w3c/csswg-drafts/pull/8974 is a potential draft spec change that we could make here.

css-meeting-bot commented 1 year ago

The CSS Working Group just discussed language parameters for word-boundary-detection, and agreed to the following:

The full IRC log of that discussion <fantasai> Topic: language parameters for word-boundary-detection
<fantasai> github: https://github.com/w3c/csswg-drafts/issues/7193
<fantasai> myles: THis is a topic about line-breaking
<fantasai> myles: we're implementing fancy line breaking, and I hear Chrome is doing the same
<fantasai> myles: interesting part is that it's based on words and phrases for CJK
<fantasai> myles: right now opt-in for CSS is word-boundary-detection with auto value
<fantasai> myles: auto value in CSS is actually a function that takes a locale string
<fantasai> myles: this issue is for removing the locale string
<fantasai> myles: 2 reasons we think it's good idea to remove
<fantasai> myles: 1st, we don't have ability to do this in our platform APIs, can't distinguish language
<fantasai> myles: 2nd is, if dictionaries aren't available for a language we fall back to normal rules, and that's fine, not a deal-breaker
<fantasai> myles: so turn it on for some languages and not others, doesn't help authors and doesn't help implementers
<Rossen_> ack flackr
<Rossen_> ack florian
<fantasai> florian: Doing something like this has been on my to-do list for a long time, so thanks for the push
<fantasai> florian: this is the direction I want to go in as well, and i18nWG as well
<fantasai> florian: as for specific PR, I haven't reviewed yet, and will do this week
<fantasai> florian: needs more work
<fantasai> florian: you extracted some bits to put into word-break, and that's fine, but leftover bits don't make sense
<iank_> From my understanding (i'd need to double check with Koji) I believe we support a new separate value for word-break.
<fantasai> florian: we might actually want to remove the rest of word-boundary-detection entirely
<fantasai> florian: and then there's some shared definitions if we're keeping it, and word-break with new value, so it needs more editorial adjustment
<fantasai> florian: but we're getting somewhere
<fantasai> florian: but I have a question, in the new PR
<fantasai> florian: I've heard argued both ways before
<fantasai> florian: so wondering what you had in mind
<fantasai> florian: You said in intro, "this is for phrase detection"
<fantasai> florian: but there was also suggestion of doing phrase grouping for languages like English, which do space separation
<fantasai> florian: this would e.g. group noun with its article
<fantasai> florian: Are you thinking MAY, MUST NOT, or SHOULD for such languages?
<fantasai> myles: first few topics you describe are editorial, don't need to discuss in WG
<fantasai> myles: last question, our linebreakers right now don't affect Latin scripts
<fantasai> myles: in the future we might want to add support
<fantasai> iank_: same here
<fantasai> florian: so even though this is on my back burner, I will be able to within the week
<fantasai> florian: so I certainly like to do this
<iank_> (I believe we are the same - but I might be wrong - and would have to double check with Koji).
<fantasai> florian: I think we probably also want to ask i18nWG for part you didn't touch
<fantasai> florian: for languages like Thai, effectively this is already baked in
<fantasai> florian: Thai doesn't use spaces, but it uses dictionary-based word detection to find word breaks
<fantasai> florian: that's by default
<fantasai> florian: but the word-boundary-detection had option to turn that off, in case authors wanted to do it manually
<fantasai> florian: maybe because they are e.g. writing a language that's not quite Thai
<fantasai> florian: so question for i18nWG is, do we want to preserve this ability? in which case we might need a keyword for that in word-break as well
<fantasai> Rossen_: Florian, can't tell if you're diverting resolution?
<fantasai> florian: It's going in the right direction, but not ready yet
<fantasai> myles: Proposal is to remove auto() function from word-boundary-detection and add keyword to word-break
<fantasai> florian: fully support
<fantasai> Rossen_: any additional comments or objections?
<fantasai> RESOLVED: remove auto () from word-boundary-detection, add keyword to word-break for this functionality
chrishtr commented 1 year ago

Agenda+ to resolve on the exact name for the word-break value. Three options to emoji-vote on:

The reason to consider variants of auto in the name is that phrase-base line breaking may not be available in the given language or platform.

css-meeting-bot commented 1 year ago

The CSS Working Group just discussed [css-text-4] Don't provide a language parameter for word-boundary-detection, and agreed to the following:

The full IRC log of that discussion <florian> q+
<TabAtkins> I'm weakly towards "phrase" just to suggest what it's actually doing.
<dael_> chrishtr: My understanding is we're all in agreement but need a name
<TabAtkins> and that keyword sounds reasonable from a complexity standpoint
<dael_> chrishtr: Three suggested are what I mentioned
<astearns> ack florian
<dael_> chrishtr: Reason to choose auto-phrase is because it could fallback if there isn't platform support
<dael_> florian: I think auto alone would not be good
<astearns> ack fantasai
<dael_> florian: I think auto alone would be a bad value because it already has a normal. auto-phrase is reasonable. I think keep-phrase would also be okay. I could go either way between those
<dael_> fantasai: Interacts a bit with property name for the whitespace text transform something which automatically inserts spaces at these points. A keyword that can be in both makes sense which auto-phrase does that. auto is too generic and keep-phrase doesn't work forword boundry
<dael_> florian: agree
<dael_> astearns: arguments for anything else?
<dael_> astearns: Objections to naming it auto-phrase
<dael_> RESOLVED: name it auto-phrase
<dael_> fantasai: As people impl this, one consideration we need to make sure is when you turn this on you don't end up triggering overflow b/c the phrases. We'll want to avoid that, particularly if it's long or it's foreign and you can't understand. Normal should be to wrap at a boundry not overflow
<dael_> chrishtr: Makes sense
<dael_> astearns: Anything to resolve here?
<dael_> fantasai: Prop: If your phrase is too long you should break at a normal word boundry rather than overflow
<dael_> astearns: Comfortable with that?
<dael_> florian: Yes
<dael_> astearns: prop: add a principle saying you should break within the phrase instead of overflowing
<dael_> astearns: obj?
<dael_> RESOLVED: add a principle saying you should break within the phrase instead of overflowing
<fantasai> s/within the phrase/at normal word boundary/
<dael_> RESOLVED: If your phrase is too long you should break at a normal word boundry rather than overflow
<fantasai> cool