w3c / csswg-drafts

CSS Working Group Editor Drafts
https://drafts.csswg.org/
Other
4.43k stars 657 forks source link

[css-text-4] Bikeshedding word-boundary-expansion #7385

Open fantasai opened 2 years ago

fantasai commented 2 years ago

word-boundary-expansion is currently about expanding spaces in CJK, but it could be used more generically, e.g. to swap between spaces and Ethiopic spaces in Ethiopic. Can we rename this property to work better for other use cases?

Proposed name: word-boundary-transform, hooks in nicely with text-transform.

(I also am not a huge fan of the word-boundary part but I don't have a great idea. Maybe word-space-transform? word-splitter-transform? Idk.)

fantasai commented 1 year ago

Some notes on how to expand this out to accommodate additional space types:

none | [ zero-width | space | ideographic-space | ethiopic-space ]{1,2}#

A first level could have only the 1-keyword version.

fantasai commented 1 year ago

Florian notes that maybe interpunct belongs in this list also. https://en.wikipedia.org/wiki/Interpunct

frivoal commented 1 year ago

I have looked into interpuncts, and changed my mind about them.

Click to expand a detailed explanation of why It is tempting to support conversions to and from interpunct as well, primarily for the sake of modern vs ancient renditions of Latin text: there was a shift in usage of the Latin script from word separation with interpunct in the classical Roman period, to no delimiter in late antiquity, to space-separated words in the early middle age, to space-separated words with punctuation around the renaissance, and a number of variants along the way. However, switching from one style to the another involves more than just swapping one type of space with another, and would also require punctuation transformation, and a few other things. For instance, here is a modern rendition of the beginning of Res Gestae Divi Augusti, followed by a classical one.
Annos undeviginti natus exercitum privato consilio et privata impensa comparavi, per quem rem publicam dominatione factionis oppressam in libertatem vindicavi. Quas ob res senatus decretis honorificis in ordinem suum me adlegit C. Pansa A. Hirtio consulibus, consula rem locum sententiae dicendae simul dans, et imperium mihi dedit.
ANNOS·​VNDEVIGINTI·​NATVS·​EXERCITVM·​PRIVATO·​CONSILIO·​ET·​PRIVATA·​IMPENSA·​COMPARAVI·​PER·​QVEM·​REM·​PVBLICAM·​DOMINATIONE·​FACTIONIS·​OPPRESSAM·​IN·​LIBERTATEM·​VINDICAVI· QUAS·​OB·​RES·​SENATVS·​DECRETIS·​HONORIFICIS·​IN·​ORDINEM·​SVVM·​ME·​ADLEGIT·​C·PANSA·​A·HIRTIO·​CONSVLIBVS·​CONSVLA·​REM·​LOCVM·​SENTENTIAE·​DICENDAE·​SIMVL·​DANS·​ET·​IMPERIVM·​MIHI·​DEDIT·
This involves transforming: * lone spaces into interpunct+zero-width-space * comma+space into interpunct+zero-width-space * period+space into interpunct+space (or interpunct+zero-width-space, depending on style) * period+NBSP into interpunct (or interpunct+zero-width-space, depending on style) * not shown in this example, but ideally trailing interpuncts at the end of a line should be removed (and possibly `word-break: break-all` should be applied, depending on style). * lower case to upper case (which can be handled with `text-transform`), alongside u to V and j to I (which `text-transform` theoretically could handle, but doesn't) * not shown in this example, but if the text had been written to indicate long vowels, transforming from modern to classical would also involve transforming marcons to apices, except for ī that maps to ꟾ (U+A7FE), so the first two words `Annōs ūndēvīgintī` become `ANNÓS·​V́NDÉVꟾGINTꟾ`. That too could fit in `text-transform` in theory. The parts that don't fit in `text-transform` seems beyond the reasonable scope of this property, and the precise rules might even need to be fine tuned for the particular content and styles in question, making it impractical to provide a generic built-in transform. And without doing all of it, you're not switching from one legitimate style to a different legitimate style, and it's unlikely anyone would want it. Interpunct in other languages is typically used for different purposes, so if it cannot be done for Latin, it's not worth doing at all.

TLDR: Transformations from zwsp or space to interpuct, or the other way around, would either be excessively complex, or not practical to use, or both, and even though I was tempted, I think we should not attempt them here.