CSS: adding support for "line-break:" and "word-break:"

koreader / crengine

This is the KOReader CREngine fork. It cross-pollinates with the official CoolReader repository at https://github.com/buggins/coolreader, in case you were looking for that one.

72 stars 45 forks source link

CSS: adding support for "line-break:" and "word-break:" #420

Closed poire-z closed 3 years ago

poire-z commented 3 years ago

Regarding https://www.mobileread.com/forums/showthread.php?t=337181

(@Frenzie) Actually there is line-break: anywhere, which while not currently supported is presumably fairly easy to implement should someone wish to do so.

https://developer.mozilla.org/en-US/docs/Web/CSS/line-break

Just forcing, after libunibreak has set flags, always: m_flags[pos-1] |= LCHAR_ALLOW_WRAP_AFTER would give:

It would need some added css parsing simple code and slots, and would waste 1 bit in a lUInt32 flags (so, nearly filling it, we might need to bump it to lUInt64 soon to store more flags - @NiLuJe : could that cause some performance issues on 32bits arches, doing some bitwise & checks ? Or is it negligeable ?).

But I must say it's really ugly and unreadable. I know it's just a CSS property that it might be nice to support, and users could get things however ugly they want via style tweaks... But isn't it our mission statement to get only nice rendering ? :)

NiLuJe commented 3 years ago

It would need some added css parsing simple code and slots, and would waste 1 bit in a lUInt32 flags (so, nearly filling it, we might need to bump it to lUInt64 soon to store more flags - @NiLuJe : could that cause some performance issues on 32bits arches, doing some bitwise & checks ? Or is it negligeable ?).

Bit ops should have a negligible impact ;).

But I must say it's really ugly and unreadable. I know it's just a CSS property that it might be nice to support, and users could get things however ugly they want via style tweaks... But isn't it our mission statement to get only nice rendering ? :)

It could possibly make (some) sense in some corner-cases, possibly centered around gigantic font sizes/margins? But, yeah, not particularly pretty/readable ;).

Frenzie commented 3 years ago

NB It's for CJK, not for us Westerners. (Except for the one person on MR, I suppose.)

On Sat, Feb 13, 2021, 19:49 NiLuJe notifications@github.com wrote:

It would need some added css parsing simple code and slots, and would waste 1 bit in a lUInt32 flags (so, nearly filling it, we might need to bump it to lUInt64 soon to store more flags - @NiLuJe https://github.com/NiLuJe : could that cause some performance issues on 32bits arches, doing some bitwise & checks ? Or is it negligeable ?).

Bit ops should have a negligible impact ;).

But I must say it's really ugly and unreadable. I know it's just a CSS property that it might be nice to support, and users could get things however ugly they want via style tweaks... But isn't it our mission statement to get only nice rendering ? :)

It could possibly make (some) sense in some corner-cases, possibly centered around gigantic font sizes/margins? But, yeah, not particularly pretty/readable ;).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/koreader/crengine/issues/420#issuecomment-778661295, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABRQBOYUHZVMK6CPGSZ42LS63CUXANCNFSM4XSMKZYA .

poire-z commented 3 years ago

Unrelated, but avoiding opening a new issue for a question.

My currently read book has some CSS properties like -epub-hyphens, -epub-text-align-last. that seem to have been a thing with an early version of EPUB3... http://idpf.org/epub/301/spec/epub-contentdocs.html#sec-css-text In this book, it is not duplicated as a normal hyphen or text-align-last property. Having them not supported and not applied degrades (a bit) the quality of the rendering.

I'm wondering if it would make sense, when a property does not match any of the normal ones we support, to try checking for a common prefix: -moz- -webkit- -epub- , remove that prefix, and redo the check with what we get. Note that we might then get duplicated properties - I don't think as we parse them we can notice they are duplicated (they are appended to a buffer as we parse).

Thoughts ?

Frenzie commented 3 years ago

I'm insufficiently familiar with -epub- (e.g., will the property values of -epub-hyphens actually match those for hyphens?), but note that both Opera and Mozilla felt necessitated to implement some select -webkit- prefixes. See here for quite the list. It could make sense to support those specific properties if we even support any of the non-prefixed ones, but I'd say not -webkit- in general, and not -moz- as a standard thing either. The main problem is those prefixed properties are often subtly different from the final spec.

-epub- is probably a bit different. We're putting together an EPUB viewer after all.

I'd also like to draw your attention to -prince- prefixed properties, for example https://www.princexml.com/doc/css-props/#prop-prince-footnote-policy and https://www.princexml.com/doc/css-props/#prop-prince-forced-breaks

I don't necessarily mean that in the sense of supporting -prince- directly, but some of them might serve as inspiration for -cre- prefixes that are more relevant to the way we render things than many of the properties supported by browsers.

Note that we might then get duplicated properties - I don't think as we parse them we can notice they are duplicated (they are appended to a buffer as we parse).

Because they're prefixed? If they're not treated in the right cascading order I'd rather ignore the prefixed ones .

poire-z commented 3 years ago

The main problem is those prefixed properties are often subtly different from the final spec.

The idea is to just expect they match keywords: if they match some that we know for the normal property, assume it means the same thing. If they have values that are not known keywords for the normal property, just ignore the value as for any invalid/unsupported keyword.

I don't want to have to maintain a list of possible prefixed values. Assume that if there's never been a -webkit-font-weight:, we shouldn't meet one and it doesn't hurt checking - and if we meet one, some author invented it for that purpose thinking it existed, so parse it and handle it.

I could go with:

     while ( *decl && *decl != '}' ) {
         skip_spaces( decl );
         css_decl_code prop_code = parse_property_name( decl );
+        if (prop_code == cssd_unknown) {
+            // Check for common prefixes, skipping them if matched, and looking
+            // again for a known property
+            if (substr_icompare( "-epub-", decl, true )) { // this, it matching, advances the pointer 'decl' till after the match
+                prop_code = parse_property_name( decl );
+            }
+            else if (substr_icompare( "-webkit-", decl, true )) {
+                prop_code = parse_property_name( decl );
+            }
+            else if (substr_icompare( "-moz-", decl, true )) {
+                prop_code = parse_property_name( decl );
+            }
+        }

Because they're prefixed? If they're not treated in the right cascading order I'd rather ignore the prefixed ones .

They would be.

.auteur {
        margin : 0% 0% 1% 0%;
        text-indent : 0%;
        text-align : right;
        -webkit-hyphens : none;
        -moz-hyphens : none;
        -ms-hyphens : none;
        -o-hyphens : none;
        hyphens : none;
        text-rendering : optimizeSpeed;
 }

would be treated as:

.auteur {
        margin : 0% 0% 1% 0%;
        text-indent : 0%;
        text-align : right;
        hyphens : none;
        hyphens : none;
        hyphens : none;
        hyphens : none;
        hyphens : none;
        text-rendering : optimizeSpeed;
 }

so, storing 5 times the same hyphens : none in the (serialized) declaration, and applying them in the order met: the last one wins.

Ok, in that sample, they are all here, including the normal one. But in very old EPUBs, we might meet:

.auteur {
        margin : 0% 0% 1% 0%;
        text-indent : 0%;
        text-align : right;
        -webkit-hyphens : none;
        -moz-hyphens : none;
 }

In my current EPUB, I have:

p._2_CORPS_ChapTitre_chap {
        -epub-hyphens:none;
        color:#636363;
        font-family: serif;
}
p.ParaOverride-2 {
        -epub-text-align-last:justify;
        text-align:justify;
}

I'm happy to get the first one working (but totally unhappy about the 2nd one ! which is ridiculous and probably not intended...)

Frenzie commented 3 years ago

some author invented it for that purpose thinking it existed, so parse it and handle it.

But while testing they won't have noticed that -webkit-font-size: 100em completely destroys the intended display because it was simply ignored by the rendering engine. ;-) You need a hand-tailored list that maps to supported properties. We don't have to do that tailoring ourselves because Mozilla and Opera already did it. Mozilla referenced above, Opera here. I think the tl;dr is probably nothing we actually support at all.

so, storing 5 times the same hyphens : none in the (serialized) declaration, and applying them in the order met: the last one wins.

Okay, that's good. That's how it's intended to work. :-D

I'm happy to get the first one working (but totally unhappy about the 2nd one ! which is ridiculous and probably not intended...)

I'm not sure I understand what you mean. That is the correct way to do it. Same as in your .auteur. If you only use the prefixed version, you know for sure it won't work when the prefix isn't supported. If you always include the unprefixed version, you run a slight risk it won't work but it probably will (unless you're really, really early and the spec hasn't been adjusted based on implementations yet). By including the unprefixed version, at worst it won't work as if it wasn't there, and with any luck it'll Just Work™ a few years down the line. I literally had an experiment from 2002 that went something like:

.auteur {
 -moz-border-radius: 1em;
 border-radius: 1em;
}

It took more than 8 years, but it came out swinging.

On the flip-side, in Opera from the same era you could do something like this:

img[title]::after {display:block; margin-top:1em; content:attr(title)}

That won't work in anything other than Opera/Presto. (Unless it works in Prince.) And that was fully per-spec, no prefixes necessary.

poire-z commented 3 years ago

because it was simply ignored by the rendering engine. ;-)

Alright :/ So may be we should just enumerate them, and handle them as we do for the various hyphenation ones: https://github.com/koreader/crengine/blob/bc2b31e6b37fdb7eca78b6b96dde181ad0ca6f72/crengine/src/lvstsheet.cpp#L36-L44 https://github.com/koreader/crengine/blob/bc2b31e6b37fdb7eca78b6b96dde181ad0ca6f72/crengine/src/lvstsheet.cpp#L134-L142

I think the tl;dr is probably nothing we actually support at all.

There are a few:

-webkit-background-size
-webkit-font-variant-ligatures
-webkit-hyphens
-webkit-line-break (this issue)
-epub-text-decoration

Plus the epub ones from the specs above:

    -epub-hyphens
    -epub-line-break
    -epub-text-align-last

I'm happy to get the first one working (but totally unhappy about the 2nd one ! which is ridiculous and probably not intended...)

I'm not sure I understand what you mean

Not a technical unhappiness, a end user one :) By the second one, I meant the whole:

p.ParaOverride-2 {
        -epub-text-align-last:justify; /* and specifically this one */
        text-align:justify; /* this one is fine */
}

I just mean that as the end user reading, I was happy to get -epub-hyphens: none supported - but I was happier before -epub-text-align-last: justify; was supported, because it's probably a publisher error to have marked a few paragraph with this p.ParaOverride-2 { text-align-last:justify; }. I don't think this publisher wanted his book to look like this :) But ok, we would render it as the CSS tells us to do.

poire-z commented 3 years ago

I have implemented support for line-break: anywhere, but now, I'm tempted to properly handle loose/normal/strict so we are done with it. And while at it, let's also handle word-break: normal/keep-all/break-all - break-all might be a better solution for this mobiread user, as it may keep more correct wrapping around punctuations (but I guess that at this point of readability degradation, who cares :) But this might be more of interest to Korean and Ethiopian readers, where breaking by space only, or on each ideograph, are 2 valid typographical choices.

Some links: https://florian.rivoal.net/talks/line-breaking/ https://developer.mozilla.org/en-US/docs/Web/CSS/word-break https://drafts.csswg.org/css-text-3/#word-break-property https://drafts.csswg.org/css-text-3/#line-break-property

Implementation in Rust of UAX14, including tweaks by these CSS properties: https://github.com/makotokato/uax14_rs

How WebKit implemented line-break: https://bugs.webkit.org/show_bug.cgi?id=89235 https://trac.webkit.org/changeset/176473/webkit https://trac.webkit.org/wiki/LineBreakingCSS3Mapping

How Mozilla implemented line-break: https://bugzilla.mozilla.org/show_bug.cgi?id=1011369 https://bugzilla.mozilla.org/show_bug.cgi?id=1531715 https://hg.mozilla.org/integration/autoland/rev/436e3199c386 https://hg.mozilla.org/releases/mozilla-esr78/file/tip/intl/lwbrk/LineBreaker.cpp

Mozilla doesn't use UAX14, but keep using some classification build from some Japanese specs.

I won't be able to judge the quality of any tweak, but their code includes some hex chars that need tweaking when loose/normal/strict, so I might just pick them and try to guess what's the right UAX14 class to substitute them with. In textlang.cpp, we have some lb_char_sub_func that can already substitute chars per language exactly "for the purpose of line breaking", so these might be easy to do:

break-all Breaking is allowed within “words”: specifically, in addition to soft wrap opportunities allowed for normal, any typographic letter units (and any typographic character units resolving to the NU (“numeric”), AL (“alphabetic”), or SA (“Southeast Asian”) line breaking classes [UAX14]) are instead treated as ID (“ideographic characters”) for the purpose of line-breaking.

keep-all Breaking is forbidden within “words”: implicit soft wrap opportunities between typographic letter units (or other typographic character units belonging to the NU, AL, AI, or ID Unicode line breaking classes [UAX14]) are suppressed, i.e. breaks are prohibited between pairs of such characters (regardless of line-break settings other than anywhere) except where opportunities exist due to dictionary-based breaking. Otherwise this option is equivalent to normal. In this style, sequences of CJK characters do not break.

Frenzie commented 3 years ago

Wow, talk about above and beyond. :-D

Since you're doing a deep dive anyway, I might as well casually mention overflow-wrap / word-break: break-word. (Or was that already supported… it's hard to keep track with the speed of development sometimes. :blush: )

https://developer.mozilla.org/en-US/docs/Web/CSS/overflow-wrap https://www.princexml.com/doc/css-props/#prop-overflow-wrap

https://developer.mozilla.org/en-US/docs/Web/CSS/word-break https://www.princexml.com/doc/css-props/#prop-word-break

https://www.princexml.com/doc/css-props/#prop-line-stacking-strategy

https://www.princexml.com/doc/11/properties/prince-wrap-inside/

poire-z commented 3 years ago

I believe word-break: break-word:

For compatibility with legacy content, the word-break property also supports a deprecated break-word keyword. When specified, this has the same effect as word-break: normal and overflow-wrap: anywhere, regardless of the actual value of the overflow-wrap property.

is the default behaviour of crengine (with possibly overflow-wrap: break-word instead of anywhere, as we indeed do this: soft wrap opportunities introduced by the word break are NOT considered when calculating min-content intrinsic sizes

We never let text overflow its container:

we break on allowed line breaks (normal UAX14, possibly tweaked by word-break or line-break soon)
if we did break and there's too much space left, we try hyphenation if allowed
if we did not break: if the single word does not overflow, we let it be, left aligned - if the single word overflows, we break it anywhere so the left part fits (dunno if we try hyphenation in this case, possibly yes - so, if possible hyphenation before the overflow, we might use it instead of breaking that word anywhere).

This is ensure by some normal code in many places, so I'd rather not have to tweak it :) (and not have to handle overflow-wrap: anywhere (Soft wrap opportunities introduced by the word break are considered when calculating min-content intrinsic sizes) in getRenderedWidths().

So, I'll skip overflow-wrap: normal/anywhere :)

Never heard of line-stacking-strategy - but:

The value block-line-height uses the line-height of the block element and ignores the actual height of the content on those lines, so lines will always have the same spacing regardless of whether they contain spans with larger font size of superscripts/subscripts.

feels very like our own -cr-hint: strut-confined (Prevent inline content like sub- and superscript from changing their paragraph line height). The other values could not be too hard (ignore the strut, or scale to a multiple of the strut) but feels less useful. But well :) If at the time I had known about that property, and if at the time we had already found a solution to the 64 bits/slots limit for styles, I would probably have gone at implementing the much needed strut-confinement via line-stacking-strategy: block-line-height.

About prince-wrap-inside: auto | phrase : painful enough to deal with lines and words, I won't additionally deal with sentences ! ;)

Frenzie commented 3 years ago

Thanks for taking it seriously. 👌