Enhanced text layout: links, thoughts and discussion

poire-z commented 5 years ago

I've been looking at implementing some alternative hanging punctuation code as envisionned in https://github.com/koreader/koreader/issues/2844#issuecomment-464483142.

I figured I may need, in lvtextfm.cpp, some alternative methods for laying out lines and spacing words (but just that, not redesigning the whole thing!), and so I began looking at other topics like line breaking, bi-directionnal text, proper CJK text layout, at least to see how that could fit in and to not take early some wrong directions that would forbid working on these additional features.

Sadly, I know nothing about other languages and writtings than western ones... So I have tons of questions for CJK and RTL readers, that I may ask later in this issue, if there's some willing to help on that :) I have no personal use for all that, as I only read western, but these are some quite interesting topics :) I sometimes think that it could be simple to have these right by just using the appropriate thirdparty libraries. But at some other moments, I feel that even the libraries won't do all that correctly, and there may be much manual tweaking needed, possibly by language... So, it ends up feeling like opening a can of worms...

Terminology: CJK = Chinese, Japanese and Korean RTL = Right To Left (Arabic, Persian, Hebrew... scripts) LTR = Left To Right (Latin, western languages, CJK...) Bidi = Bidirectional text (LTR and RTL mixed) For now, just cut and pasting and organizing my accumulation of urls and thoughts.

Unicode text layout references and algorithms:

http://www.unicode.org/reports/tr14/ UAX#14: Unicode Line Breaking Algorithm http://www.unicode.org/Public/UCD/latest/ucd/LineBreak.txt reference file http://jkorpela.fi/unicode/linebr.html Unicode line breaking rules: explanations and criticism https://www.unicode.org/reports/tr29/ UAX#29: Unicode Text Segmentation http://www.unicode.org/reports/tr9/ UAX#9: Unicode Bidirectional Algorithm http://www.unicode.org/reports/tr11/ UAX#11: East Asian Width https://www.w3.org/TR/jlreq/ Requirements for Japanese Text Layout https://www.w3.org/TR/clreq/ Requirements for Chinese Text Layout https://w3c.github.io/typography/ International text layout and typography index (links) https://unicode.org/cldr/utility/breaks.jsp Unicode Utilities (to test algorithms output)

https://drafts.csswg.org/css-text-3/ CSS take on all that enhanced typography Appendix D,E,F gives some insight about writting systems and the importance of the lang= attribute

Sites with valuable informations about foreign scripts, languages, typography and chars

https://r12a.github.io/scripts/ Wonderful and complete descriptions of each script, usage, layout https://r12a.github.io/scripts/phrases Sample phrases in various scripts https://r12a.github.io/scripts/tutorial/summaries/wrapping Sample phrases for testing wrapping http://www.alanwood.net/unicode/index.html Dated, but very complete http://jkorpela.fi/chars/index.html Characters and encodings http://jkorpela.fi/chars/spaces.html http://jkorpela.fi/dashes.html https://jrgraphix.net/research/unicode.php Unicode Character Ranges https://unicode.org/charts/ http://unifoundry.com/unifont/index.html large single image of the full unicode planes

Line breaking & justification

https://en.wikipedia.org/wiki/Line_wrap_and_word_wrap https://en.wikipedia.org/wiki/Line_breaking_rules_in_East_Asian_languages https://www.w3.org/International/articles/css3-text/ CSS and International Text (line breaking and text alignment) https://www.w3.org/International/articles/typography/justification Approaches to full justification http://w3c.github.io/i18n-drafts/articles/typography/linebreak.en Approaches to line breaking https://www.w3.org/TR/2003/CR-css3-text-20030514/#justification-prop CSS justification options, describing the various ways to justify appropriately for some scripts

https://github.com/bramstein/typeset/ TeX line breaking algorithm in JavaScript https://wiki.mozilla.org/Gecko:Line_Breaking Mozilla documentation about line breaking (obsolete? mention it should switch to UAX#14) implemented in https://github.com/mozilla-services/services-central-legacy/blob/master/intl/lwbrk/src/nsJISx4501LineBreaker.cpp

Hanging punctuation / Optical margin alignment

https://en.wikipedia.org/wiki/Hanging_punctuation https://en.wikipedia.org/wiki/Optical_margin_alignment https://askfrance.me/q/comment-bien-choisir-saillie-pour-les-lettres-et-la-ponctuation-hors-36070225344 https://helpx.adobe.com/fr/photoshop/using/formatting-paragraphs.html#specify_hanging_punctuation_for_roman_fonts https://french.stackexchange.com/questions/1432/whats-hanging-punctuation-in-french https://drafts.csswg.org/css-text/#hanging https://www.w3.org/TR/css-text-3/#hanging-punctuation-property There is support in CSS, but it's very limited and targetted to CJK

Relevant commit about its implementation in crengine: 3ffe69441 (extended to other ideograph by 81bbb8d6c 59377ba89).

I figured we could have both CJK hanging punctuation and western optical margin alignment handled the same, by using, for each candidate glyphs, a % of its width, to be pushed in the margins. So, hanging punctuation in CJK can go fully in the margin, because the fixed-width ideogram glyphs have a good amount of blank space, and in the end, the space taken in the margin is smaller than the ideogram widths. For other western non-fixed-width glyps (punctuation), we would use a %. Some suggesstions and discussions at: https://www.w3.org/Mail/flatten/index?subject=Amending+hanging-punctuation+for+Western+typography&list=www-style https://source.contextgarden.net/tex/context/base/mkiv/font-imp-quality.lua hanging punctuation percentage by char https://lists.w3.org/Archives/Public/www-style/2011Apr/0276.html

BIDI / RTL:

https://www.w3.org/International/articlelist#direction https://www.w3.org/International/questions/qa-html-dir Q/A https://www.w3.org/International/articles/inline-bidi-markup/uba-basics https://www.w3.org/International/articles/inline-bidi-markup/index.en for inline elements https://www.w3.org/International/questions/qa-html-dir.en for block elements http://www.i18nguy.com/markup/right-to-left.html https://www.mobileread.com/forums/showpost.php?p=3828770&postcount=406 sample-persian-book.epub with screenshots of the expected result

Unrelated to crengine, but to check if we want to make the UI RTL: https://labs.spotify.com/2019/04/15/right-to-left-the-mirror-world/ https://material.io/design/usability/bidirectionality.html UI

Various articles about the text layout process

Available libraries that could help with that

For illustration, there is a Lua module that provides the full text rendering stack and use many of these libraries, which is interesting to look at (as it's readable :) and may be the only small complete full stack I found, which shows the order of how things should be done. https://luapower.com/tr unibreak, fribidi in lua https://github.com/fribidi/fribidi/issues/30 interesting Q/A between the author and the HB people https://github.com/luapower/tr/blob/master/tr_research.txt some short notes on these same topics

There is also this Lua layout engine, which has "just enough" wrappers, and has many specific tweaks per language: https://github.com/simoncozens/sile/ (see justenoughharfbuzz.c, languages/fr.lua...) https://github.com/Yoxem/sile/commits/master w.i.p. chinese zh.lua adapted from ja.lua https://github.com/michal-h21/luatex-harfbuzz-shaper

utf8proc

https://github.com/JuliaStrings/utf8proc http://juliastrings.github.io/utf8proc/doc/ Provides helpers for Unicode categorization (but a bit limited, as it does not provide all of them, like the unicode script - we can't use it to detect if some char is Chinese or Korean). https://stackoverflow.com/questions/9868792/find-out-the-unicode-script-of-a-character gist in 1st answer gives a simple implementation for detecting script

harfbuzz

https://github.com/harfbuzz/harfbuzz We already use it for font shaping in kerning "best" mode. It can also provides useful things like direction and script detection of what we throw at it (https://harfbuzz.github.io/harfbuzz-hb-buffer.html), so it may complement utf8proc for some Unicode categorisation. (It includes UCDN https://harfbuzz.github.io/utilities-ucdn.html, so we get additional functions for free).

libunibreak

https://github.com/adah1972/libunibreak implements UAX#14 and UAX#29 https://luapower.com/libunibreak https://github.com/luapower/libunibreak Lua wrapper https://github.com/HOST-Oman/libraqm/pull/76 open PR to use libunibreak in libraqm https://github.com/adah1972/libunibreak/issues/16 word breaks is less obvious This works only on the text nodes in logical order, and could be used in crengine src/lvtextfm.cpp copyText() to set/unset LCHAR_ALLOW_WRAP_AFTER, trusting it and removing our explicite check for isCJKIdeograph() in processParagraph() and other places). I initially thought our check for isCJKIdeograph() was wrong as it is allowing breaks after any korean glyph (which are like syllables), but Korean have words separated by spaces, so we should use spaces like in western scripts. But it looks like korean, even if it has spaced words, allow a line break in the middle of such a word. So we're probably already fine with korean. libunibreak accepts a language parameter, but it's only used to add a few rules for breaking line specific to that language, mostly related to quotes (list in https://github.com/adah1972/libunibreak/blob/master/src/linebreakdef.c). So, I discovered that German strangely closes on left angle quotation marks, and opens on right angle quotations marks :) (so, I guess what I put in #237 might give strange things on german text, unless they don't use spaces on both side, and only french does that). Anyway, I'd like us to not have to detect the document or text segment language, neither to have it to be provided by frontend, to keep things simple. Dunno if that's a viable wish.

Some discussion about reshaping because of line breaks, and some unsafe_to_break flag that could/should complement our "is_ligature_tail" flag: https://github.com/harfbuzz/harfbuzz/issues/1463#issuecomment-505592189 https://github.com/linebender/skribo/issues/4

We may also need to pass HB_BUFFER_FLAG_BOT / HB_BUFFER_FLAG_EOT to HarfBuzz for specific shaping at Begin/End Of Text (=paragraph).

Other code using libunibreak: https://github.com/geometer/FBReader/blob/master/zlibrary/text/src/area/ZLTextParagraphBuilder.cpp FBReader https://git.enlightenment.org/core/efl.git/tree/src/lib/evas/canvas/evas_object_textblock.c enlightenment

Note: when a word is followed by multiple spaces, libunibreak set the allowed break on the last space - crengine will want it on the first space, the others should be marked as collapsed spaces and should be at the beginning of the next word, where they will be ignored.

fribidi

https://github.com/fribidi/fribidi fribidi (implements UAX#9) This works only on the text buffer in logical order, and fill another buffer (lUint32, so as large as the text buffer) from which we can get the bidi level of each char (because english can be detected to be embedded in some arabic which is itself part of some english paragraph...). Could be used in crengine src/lvtextfm.cpp copyText() to set that level to each char We would then need in measureText() to split on bidi level change to have a new text segment to measure (like we do when there is a font change) as a single text node can have both latin and hebrew in it, and harfbuzz expects its buffer to have a single direction and script.

After that, I guess, line breaking should be tweaked (in processParagraph()), and may by in addLine()): when processing a text line in the the logical-order, and splitting words, it should re-order the words according to their origin text segment bidi level.... We've seen that harbuzz already RTL each individual word and renders it correctly (not so nicely the way we use it currently, see below). And according to https://github.com/fribidi/fribidi/issues/30, there's quite a bit less things to do about bidi when we use harfbuzz. So, it looks to me that we should indeed split lines with the logical text order, and as harbuzz renders correctly a RTL word, we just have to re-order the words. https://github.com/fribidi/linear-reorder/blob/master/linear-reorder.c provides a generic algo. It looks like we could put a crengine formatted_word_t in that run_t to have them re-ordered. Dunno if that's as simple as that :) (After that, there may be even more complicated things to have text selection and highlighting work with bidi and RTL...)

Our current harfbuzz implementation ("best") is a bit buggy with text more complex than just western with ligatures. I thought that it does render RTL words correctly, but even that is not done well: the measurements are all messed up (we don't process correctly clusters, it there's some decomposed unicode), and the way we use the fallback font (no harfbuzz re-shaping, using the main codepoint for all chars parts of the cluster) makes wrong results.

And there are cases where the bidi algo doesn't say anything, like this reordering of soft hyphen (and so, should we hyphenate LTR in bidi test, where the hyphen may be in the middle of the line ? :) http://unicode.org/pipermail/unicode/2014-April/thread.html#353 Bidi reordering of soft hyphen

http://www.staroceans.org/myprojects/vlc/modules/text_renderer/freetype/text_layout.c one of the rare example of the use of fribidi_reorder_line, that I guess we'll have to.

One interesting solution to re-shaping with fallback fonts is how it was done in Chrome: https://lists.freedesktop.org/archives/harfbuzz/2015-October/005168.html font fallback in Chrome https://chromium.googlesource.com/chromium/src/+/9f6a2b03ccb7091804f173b70b5facff7dffbd61%5E%21/#F8 chrome improved shaping See also minikin Layout.cpp code below.

We may also need freetype rebuilt against harfbuzz.

libraqm

https://github.com/HOST-Oman/libraqm http://gtk.10911.n7.nabble.com/pango-vs-libraqm-td94839.html raqm does not do font fallback and line breaking currently, nor does it do font enumeration. Raqm is designed to add to applications that otherwise have a very simplistic view of text rendering. Ie. they use FreeType and a single font to render single-line text (think, movie subtitles...).

pango

https://developer.gnome.org/pango/stable/ https://github.com/GNOME/pango pango https://gist.github.com/bert/262331/ sample usage Pango and libraqm provide higher level functions. They do the full pipeline (unicode preprocess, shaping, bidi, linebreaking, rendering). But we can't use their high level functions because they don't do as much as crengine (vertical text alignment, inline images, floats), so if we were to use them, we'd need to provide small segments, and we may as well do that with the lower level libraries. Or skip all the crengine services (fonts management, text drawing) and use it instead, and have to re-implement all the crengine higher level functions that pango does not provide. Not my plan :) Pango has dependencies on glib and fontconfig, which does not look like fun.

The most interesting stuff in pango is in https://github.com/GNOME/pango/blob/master/pango/break.c, where it implements UAX#14 and UAX#29, like libunibreak, but in one single pass, with some additional tweaks for arabic and indian (dunno if libunibreak does that as well or not). Also in pango-layout.c justify_words(): for justification, it does as crengine does: it expands spaces. And if there is not a single one, it switches to adjusting letter spacing (which crengine does not do).

Others developments/discussions

https://raphlinus.github.io/rust/skribo/text/2019/02/27/text-layout-kickoff.html work towards a rust library https://gitlab.redox-os.org/redox-os/rusttype/issues/2

Text rendering/Font fallback in Chrome and other browsers https://chromium.googlesource.com/chromium/src/+/master/third_party/blink/renderer/platform/fonts/README.md https://gist.github.com/CrendKing/c162f5a16507d2163d58ee0cf542e695

minikin is the library used in Android for text layout with harfbuzz. It's quite tough to find some master authoritative version, cause they are many divergent ones... (and the latest Android one does not include some changes provided by Harfbuzz author, that are available in some other branches or fork). Here a few links (the interesting file is Layout.cpp): https://android.googlesource.com/platform/frameworks/minikin/ minikin main repo https://dl.khadas.com/test/github/frameworks/minikin/libs/minikin/Layout.cpp https://github.com/abarth/minikin https://github.com/flutter/engine/blob/master/third_party/txt/src/minikin/Layout.cpp https://source.codeaurora.org/quic/la/platform/frameworks/minikin with changes from HarfBuzz author https://github.com/CyanogenMod/android_frameworks_minikin/blob/cm-12.0/libs/minikin/Layout.cpp with changes from Harfbuzz author https://medium.com/mindorks/deep-dive-in-android-text-and-best-practices-part-1-6385b28eeb94 minikin (android text layout)

CJK (horizontal) layout

@frankyifei said in https://github.com/koreader/koreader/issues/2844#issuecomment-464493642:

I would say the original design of crengine did not consider CJK layout. So the code here just makes it work for CJK. The code tries to squeeze the space between characters and it is not the best way for CJK. There are better ways like squeezing punctuation marks to make the lines look even. But it is very hard to implement on current code. Traditional Chinese does not have punctuation, and characters should fill every line with identical spaces. now the language is quite westernized so many people are get used to other layout similar to European languages. Without rewriting lvtextform, it is very difficult to change things and add features. Is it possible to use pango here?

It looks like there is nothing special for CJK in pango. If a paragraph is pure CJK, if would do as crengine does, spacing each CJK char, and it would result in a nice regular ideographs grid. If there is a single latin letter or punctuation or normal space, the grid would be broken (like it is in the Wikipedia ZH I use for testing CJK). It's up to us to do the right thing in lvtextfm.cpp. A few ideas:

alignLine() uses a single flag LTEXT_WORD_CAN_ADD_SPACE_AFTER that is set on a space or a CJK ideograph. We could have multiple such flags, with different levels of priority, so a western space is prefered to a CJK ideograph, and a CJK punctuation is prefered to a CJK non-punctuation.
I don't know how much the appareance of a grid of fixed width CJK is important (I feel like it should be :). If it is, we could think of aligning each new CJK char to the next grid-set pixel on a line, so if a western word happens in CJK, there would be a little more spacing after it (or on each side) so the next CJK glyph is pushed to start on that specific grid-pixel.
I have the feeling that Korean (which seem to use western normal spaces and not the CJK Full width space) do not care about that grid, so that would have to be triggered depending on the unicode detected script.

Anyway, Pango looks like it does none of that.

Some question: with a pure CJK ideographs paragraph, and some line ending with two (or three) "CJK right punctuation", both after the available width, how should it be dealt with? If there would be only one, it could be made hanging in the right margin, and the grid would be fine. But with two? Would the 2 be hanging in the right margin? Or would they be pushed on the next line, with the previous regular char, so making a hole in the grid at the far right on the previous line (or breaking the grid if we justify that line, like crengine would do it seems). What's the proper way to handle that?

Note: there may be some stuff to fix in crengine to also consider UNICODE_NO_BREAK_SPACE for expending/decreasing spaces for justification (pango has in break.c: attrs[i].is_expandable_space = (0x0020 == wc || 0x00A0 == wc);

Vertical text layout

Low interest, cause it looks so much more complicated. Pinging @xelxebar which showed interest in all that about vertical text. Just some questions, cause I have no idea how that should work. I guess it's the whole block element that makes a vertical text section. What's the effect of a <BR>? Go back to the top of the next vertical line? What should happen when there are more <BR> than the max nb of vertical lines that could fit on the available width? I naively thought vertical text can be easily sized:

Get the width of an ideographic space, add interline space to, divide available width by that to get the nb of vertical lines to lay along the width.
Count the nb of glyphs, divide by the nb of vlines to get the nb of glyph per vertical lines. - Multiply by some ideographic glyph height and you have the height of a the block. No idea about how variable font family/style/size, vertical-align, and inline images would work with that :)
When drawing such a block, switch to some specific drawing code to lay glyph by glyph along that grid.

How that's supposed to work for long paragraphs that would span multiple pages? Should scroll mode be aware of how page mode has cut the blocks, or can it layout text on some possibly infinite vertical length? How do browsers (that don't have pages) do it?

If having a go at it, possibly first to fix harfbuzz rendering of embedded RTL, which means implementing full bidi support... Should this new possible expensive new stuff be used when selecting kerning mode "best" (which is the only one where we use harfbuzz correctly, so a requisite), or would we need a "bestest" switch, which, additionally to harfbuzz, would trigger the use of the probably expensive bidi processing? Or some additional gTextRenderingFlag to enable or not the use of any of the new features (like done for enhanced block rendering)? I fear starting all that because of the spaghetti mess it will be with so many #ifdef USE_FRIBIDI #ifdef USE_LIBUNIBREAK if we want crengine to still be able to compile and work without all these... Or a single ifdef USE_ENHANCED_TEXT_LIBRARIES (which should include USE_HARFBUZZ ?)

poire-z commented 5 years ago

Pinging some KOReader contributors, that some PRs show you must be at ease with some of these languages I'm not:

@frankyifei @houqp @chrox who did some CJK work on crengine, mainly aiming at chinese I guess @limerainne, who added Korean keyboard https://github.com/koreader/koreader/pull/5053 @alexandrahurst who added basic japanese keyboard https://github.com/koreader/koreader/pull/2930 and posted some issue about font variants on the frontend UI side https://github.com/koreader/koreader/issues/2936 @rtega who posted some issue about japanese word selection https://github.com/koreader/koreader/issues/4091 @xelxebar Japanese epub have given me enough grief https://github.com/koreader/koreader/issues/4353 @gerroon who posted quite a few issues, is in Japan and reads persian PDF https://github.com/koreader/koreader/issues/1767

Do you read EPUBs in these languages with KOReader? Is the experience good enough, or else? What's missing, what looks like easy fixing? Have I broken some stuff (line breaking, hanging punctuation) that used to work better, with my changes over the last 2 years? Any answers to my interrogations on a CJK grid and Korean line breaking above (search for these words if you don't want to read my long text :)?

robert00s commented 5 years ago

Polish and Czech typography rules requires avoiding one letter prepositions at line endings. So they should be connected with the following words by nonbreakable space.

Incorrect (Tom went to a shop in the city to buy apples and rolls.):

Tomek poszedł do sklepu w  <-- not allowed
mieście, żeby kupić jabłka i   <-- not allowed
bułki

Correct:

Tomek poszedł do sklepu   
w mieście, żeby kupić jabłka
i bułki.

Many publishers (but not all) add   to the whole document.

<p>Tomek poszedł do sklepu w mieście, żeby kupić jabłka i bułki.</p>

NiLuJe commented 5 years ago

I don't have much to say on the CJK/RTL/BIDI front, except to confirm that, yeah, we probably can't switch to pango, for pretty much the reasons you've explained. (Plus, it's possibly going to be rejigged as a thinner wrapper around harfbuzz, which sounds like a great idea, but is still far ahead on the horizon ;)).

@shermp introduced libunibreak in FBInk for line-breaking purposes, so I have moderate experience with it. Basically, it parses an utf string, and for each byte* index in that string, it fills another buffer with a few differents flags (like can't break, must break, can break, ...). What you do with that information is left to you ;).

EDIT: That implies having a pretty fast & rock-solid utf-*/unicode encoder/decoder, and a pretty solid utf-8 iterator. We've got a pretty great utf-8 decoder in FBInk, and a decent one-way iterator (next codepoint) based on it, but a pretty shitty reverse one (prev codepoint), which has lead to some pretty gnarly workarounds.

NiLuJe commented 5 years ago

@robert00s example raises a good point: I doubt UAX line-breaking rules deal with those kinds of grammatical/typographical conventions, do they?

Frenzie commented 5 years ago

UAX line-breaking rules don't include syntactic parser, nor do I think they should. ;-) Whether it's best to attach a single character like that to the previous word or the next word is something you can't really say otherwise.

For example, in a Dutch sentence like "ik ga na" it'd be preferable to keep "ga na" together, but unless you know that it is a conjugated form of the verb "nagaan" (to check up) as opposed to "gaan" (to go) there's no way to tell. Those words could also occur in a different context, e.g., "ik ga na het concert naar huis" (I go after the concert to home). In that case there'd be no preference to keep "ga na" together, possibly the opposite to prevent a "check up" reading.

poire-z commented 5 years ago

I doubt UAX line-breaking rules deal with those kinds of grammatical/typographical conventions, do they

I was hoping it would :) so we wouldn't have to bother with all these pecularities... But it probably doesn't... I'm still hoping it does that correctly for Japanese and Korean (where I understand some combination of standalone glyphs make a syllable and should not be broken) and that we can trust it for finding unbreakable points in a string of CJKs.

Polish and Czech typography rules requires avoiding one letter prepositions at line endings. [...] Many publishers (but not all) add to the whole document

In french, it's also expected from publishers with quotations marks, where there is a space after the opening and one before the closing. I added some code to avoid that in #237 even if the publisher forgot to put  . But that could be wrong with german... I really don't want to have to find out or guess a language of some bit of text, so may be we would just have a few set of flags (toggable from the UI side) to do these kind of tweaks, if hopefully we can categorize them in a small set of flags (like I think we'll have to possibly patch https://github.com/adah1972/libunibreak/blob/master/src/linebreakdef.c to have some additional combo, possibly one for german, and one that would work for all other western and CJK languages, if there's no conflict). So, we could possibly have a flag to "avoid single letter word at end of line" - or do that by default if that works in all languages (it may not change much line spacing with justificiation, it's just pushing 2 chars on the next line). And my french spacing stuff could also be a flag.

Frenzie commented 5 years ago

I don't think my German books tend to have spaces between words and quotation marks like the guillemets in French? (Not that it matters in that case.)

poire-z commented 5 years ago

https://german.stackexchange.com/questions/117/what-is-the-correct-way-to-denote-a-quotation-in-german I figure that if german uses »quoted stuff« , in a sentence like he said »this« and went away, my french rules would prevent line breaking after said and before and. But may be it's uncommon enough that » « are used, and only „ “ which we don't see in french and don't need tweaking.

Frenzie commented 5 years ago

The fourth book I pulled off the shelf used » «, so based on that sampling one in four. :-P

2019-08-22 16 07 28

poire-z commented 5 years ago

libunibreak in FBInk for line-breaking purposes [...] What you do with that information is left to you ;).

libunibreak is probably the easiest stuff to integrate. I had some quick testing, and it would just come done to adding/replacing this in copyText():

#if (USE_LIBUNIBREAK==1)
const char * lang = "fr";
if (!init_break_context_done) {
    lb_init_break_context(&lbCtx, m_text[pos], lang);
    init_break_context_done = true;
}
else {
    int brk = lb_process_next_char(&lbCtx, (utf32_t)m_text[pos]);
    // printf("between <%c%c>: brk %d\n", m_text[pos-1], m_text[pos], brk);
    if (brk == LINEBREAK_ALLOWBREAK) {
        m_flags[pos-1] |= LCHAR_ALLOW_WRAP_AFTER;
    }
    else {
        m_flags[pos-1] &= ~LCHAR_ALLOW_WRAP_AFTER;
    }
}
#endif
pos++;

and later:

+                #if (USE_LIBUNIBREAK==1)
+                if (flags & LCHAR_ALLOW_WRAP_AFTER) {
+                    lastNormalWrap = i;
+                }
+                #else
                 if ((flags & LCHAR_ALLOW_WRAP_AFTER) || isCJKIdeograph(m_text[i])) {
                     // Need to check if previous and next non-space char request a wrap on
                     // this space (or CJK char) to be avoided

It exports its low level API, so we can feed it char by char without the need to allocate another long buffer to get the results. Unlike fribidi that seems to want the full text buffer, and fill a flags buffer as long :(

I'm a bit surprised that it works this linearly, without any need to go correct decisions 2 or 3 chars ago - which may be means this UAX#14 algo is just some basic one the unicode people feel they had to provide, and is just not good enough :)

That implies having a pretty fast & rock-solid utf-*/unicode encoder/decoder, and a pretty solid utf-8 iterator

At this point of text layout, crengine has already and works with unicode codepoints, which is nice and as already prefered by most of these libraries. (the utf8 decoding is done at HTML parsing time, and if it's not fast or solid enough, it's for another topic :)

robert00s commented 5 years ago

@poire-z

So, we could possibly have a flag to "avoid single letter word at end of line" - or do that by default if that works in all languages (it may not change much line spacing with justificiation, it's just pushing 2 chars on the next line).

Of course. This should be an option (default disabled - almost all language don't have this rule).

poire-z commented 5 years ago

Another note related to the use of fonts:

crengine only gets to work with 2 fonts: the main one and a fallback one. We allow users to select both from frontend.
web browsers and other apps are able to use all the system available fonts, and need to select the right one. I guess that's what fontconfig helps with.

I hope we can stay with the 2 crengine fonts, to not add another layer of complexity :)

Personally, I've made myself a somehow complete fallback font by merging (with fontforge) a few that we provide, because each individually has holes...:

/* Run as : $ /C/Program\ Files\ \(x86\)/FontForge/bin/fontforge.exe -script */
freesans = "FreeSans.ttf"
freeserif = "FreeSerif.ttf"
notosanscjk = "NotoSansCJK-Regular.ttf"
newfont = "FreeSans-extended.ttf"
tmpfont = "FreeSans-tmp.ttf"
Open(freesans)
/* merge not found glyphs from FreeSerif */
MergeFonts(freeserif)
/* segfault if not in-between save */
Generate(tmpfont, "", 4/* remove bitmap */)
Open(tmpfont)
/* remove symbols (better/bigger in NotoSansCJK) */
Select(0u2500, 0u27FF)
DetachAndRemoveGlyphs()
/* merge not found glyphs from NotoSansCJK */
MergeFonts(notosanscjk)
SetFontNames("FreeSansExtended", "FreeSans extended", "FreeSans extended", "Book")
Generate(newfont, "", 4/* remove bitmap */)
Close()

NotoSansCJK has/had wrong glyphs for greek and arabic, so I couldn't even decode greek - FreeSans and FreeSerif have good greek and hebrew, but no CJK - and my prefered font only have latin. So I couldn't get both CJK and Hebrew shown in a same book.

Using my prefered (for look) latin font, and this fallback font, I rarely see ? gylphs. I guess most people would be happy that way: their default font for a good look of their main script, and a fallback, possibly not as nice looking, that would just show all glyphs.

Would we be allowed, and would it make sense, to provide such a fallbackenstein font with KOReader? Or is there still too much user preference to make some decisions (like me prefering to start with FreeSans, which is morphologially nearer to my prefered font, than the thin and small but nice FreeSerif)? (not willing to undertake that font building, just asking :) (Or would we need to support a 2nd fallback font in crengine ? :| )

poire-z commented 5 years ago

Also pinging @virxkane , who brought harfbuzz support into coolreader, that we then picked - for info and advice, to make sure we will do things right for russian too :)

shermp commented 5 years ago

As @NiLuJe stated, I introduced libunibreak to FBInk when I was implementing the truetype/opentype rendering to it. It was chosen because I wanted better linebreaking than what FBInk had at the time, and it was really easy to use.

As far as paragraph justification goes, some sort of best/total fit algorithm such as the Knuth & Plass (as used in TeX) would be nice to have. The CSS working group actually had some discussions about this at https://github.com/w3c/csswg-drafts/issues/672, one idea floated was using a n-line sliding window to improve linebreaking.

If you wanted to get really really fancy, one could dive down the rabbit hole that is the Microtype system implemented in pdfTeX and luaTeX, as documented here and here. (Ok, not really being serious here, that would be a LOT of work, if even possible...)

poire-z commented 5 years ago

some sort of best/total fit algorithm such as the Knuth & Plass would be nice to have

Well, crengine indeed doesn't do complicated: it has to do it linearly, char by char, never going back. It put words on a line (accounting the width used by spaces for possible shrink or expand adjustments) until one does not fit anymore. It then try to hyphenate that word that did not fit to possibly find a part of it that would fit. Might be very complicated to do some n-line sliding correctly, now that we support floats :| Also, crengine needs to initially render the full document to compute the height of all block/paragraphs, so anything more complicated and costly would increase that load time (which is already too long...)

Anyway, I'm not really an aesthete regarging text layout and fonts, but I'm rarely stopped while reading thinking wow, that's really ugly because there would be too many consecutive hyphens or too large or narrow spacing (may be once every 100 pages :) I'm more often stopped by rivers.

So, as far as latin text is concerned, I find what we have quite ok. What do others think?

I'm more inclined for now to fix occasional embedded RTL words or sentences that currently mess the surrounding latin text, and possibly have proper western optical margin alignment while making all that fine for CJK and RTL too.

shermp commented 5 years ago

Fair call on the line breaking. I know from experience with FBInk that first-fit is difficult enough to get right, and I wasn't even trying to do justification or hyphenation. And yeah, floats would throw a spanner in the works wouldn't they? Although I would have thought (and I could be totally wrong here) that the only floats that would cause concern would be those that protrude into the previous paragraph/block,

I have to admit though that after reading so long using RMSDK, There's always been something slightly off (to me) about how both the Kepub renderer and crengine (among others) do line breaking. I have no idea what Adobe did, but their algorithm is (was) probably the best line breaking I've seen outside of typesetting software. It's a shame Adobe basically abandoned their renderer :(

Frenzie commented 5 years ago

the only floats that would cause concern would be those that protrude into the previous paragraph/block,

With negative margins you mean? Otherwise floats don't really do that.

shermp commented 5 years ago

the only floats that would cause concern would be those that protrude into the previous paragraph/block,

With negative margins you mean? Otherwise floats don't really do that.

Yeah. And that's what I thought.

Which is why I would think that so long as the algorithm can deal with differing line widths, floats shouldn't make that much of a difference for a multi-line algorithm. For floats that can be pre-positioned (like dropcaps), place them first before rendering text. For mid-paragraph floats, you could probably reset the line-breaking algorithm at the line the float starts.

And of course, I could be talking out my arse, as I am probably completely wrong, so please feel free to ignore me. I'm more having a bit of a thought exercise at this point.

poire-z commented 5 years ago

For mid-paragraph floats, you could probably reset the line-breaking algorithm at the line the float starts.

Except that for such mid-paragraph (embedded) floats, when one is met, it may fit on the line, but if it can't it has to be delayed till next line. And if you have a complex algo, it may after passing it decide to shorten the text, and oups, the float could have fit after all :) And our lines can have various font sizes and vertical-align, so depending on what you still bring on that line or keep for next line, its height can change, and so would the float positionning (depending on possible other previous floats impacting its position)... And when we'll bring CJK and RTL into the mix...

Do the other well know line breaking algorithms supports bidi text? Or are supposed to work as-is with pure RTL text?

edit: But crengine has detected if floats are present before laying out lines - so this could allow having some enhanced line layout algo, as we could just swith to the current linear algo only when there are floats present (as 99.9999% paragraphs of the world don't have any :)

shermp commented 5 years ago

Except that for such mid-paragraph (embedded) floats, when one is met, it may fit on the line, but if it can't it has to be delayed till next line. And if you have a complex algo, it may after passing it decide to shorten the text, and oups, the float could have fit after all :) And our lines can have various font sizes and vertical-align, so depending on what you still bring on that line or keep for next line, its height can change, and so would the float positionning (depending on possible other previous floats impacting its position)... And when we'll bring CJK and RTL into the mix...

Do the other well know line breaking algorithms supports bidi text? Or are supposed to work as-is with pure RTL text?

I guess one potential option is to fallback to first-fit if a block of text contains an embedded float. But yeah. It's hard. I don't blame you at all if you want to stick with first-fit.

As to breaking bidi text, I have absolutely no idea how that's supposed to be handled. UAX 14 appears to have a single paragraph on the matter:

In bidirectional text, line breaks are determined before applying rule L1 of the Unicode Bidirectional Algorithm [Bidi]. However, line breaking is strictly independent of directional properties of the characters or of any auxiliary information determined by the application of rules of that algorithm.

frankyifei commented 5 years ago

It is very difficult to follow all the CJK layout rules. For example, see here Chinese Compression Rules for Punctuation Marks. The practical way is to use the default behaviour of pango and implement some rules later if necessary. If the default result is close to what chrome and firefox give, that would be good enough. For line breaking, the current code for CJK is working well although it looks complex with lots of ifs. I think it was added by @houqp ? It implemented this Prohibition Rules for Line Start and Line End . IMO It is the most obvious rule for Chinese readers.

poire-z commented 5 years ago

Thanks, added your link (and a few others) to first post. (With your link, I spent a few minutes looking for some occurence of the issue - described in the english text - in the following chinese texts, with no luck - before realizing these chinese texts were not samples/illustrations, but the chinese translations of the preceeding english text... :) too bad there's no visual sample of bad/good rendering.)

Can you please tell me about my questions about the importance or not of the (invisible) grid (wish for perfect vertical alignment of the ideographs) for chinese text, that can get messed up (and is, in crengine) when there is some non-fixed-width latin chars in the line, or when the last char is some Left punctuation that need to be pushed onto next line (making a hole the size of one or two ideograph at end of line). And if justification is more important than this grid. (Your link talks about space or punctuation compression, so I guess it doesn't care about that grid...) (I insist cause these keep haunting my thoughts :) just want to know how that should be turned around.)

edit: OK, it's mentionned in your link https://www.w3.org/TR/clreq/#handling_of_grid_alignment_in_chinese_and_western_mixed_text_composition

Due to the fact that each Han character is of the same width, not only should characters at the start and end of a line be aligned but it is also a requirement for characters within blocks of Han text to be aligned both vertically and horizontally, whether in vertical or horizontal writing mode. When Western text or European numerals are present, this principle is harder to achieve

Grid alignment is adopted more often in Traditional Chinese typesetting, whereas use in Simplified Chinese is rare.

But I'd still like your subjective opinion (because we currently don't have that, and you seem to be fine without it). I still feel we could have it with a few tricks.

About that specific code with the ifs, I expect UAX#14 / libunibreak should set the proper allow/avoid break flags for these CJK punctuations, so we may get a more correct implementation, and can avoid a bit of that code (or we may have to keep some if libunibreak is bad at that).

poire-z commented 5 years ago

@Frenzie @NiLuJe: any specific thoughts on my take on our fonts https://github.com/koreader/crengine/issues/307#issuecomment-523972046 ?

Another thing with crengine font handling, is that a tag with style="font-family: A, B, C, serif" will use font A (if found), and will pick fallback glyphs from our main fallback font, instead of trying to find them in B and C, as it should I think. An option would be to have each cregine font object carry a pointer to its next fallback, so these could be chain somehow.

Frenzie commented 5 years ago

will pick fallback glyphs from our main fallback font, instead of trying to find them in B and C, as it should I think.

It makes intuitive sense, but unless something changed in the past decade browsers don't do it that way. (Which needn't mean it's wrong, you'd have to check the spec for that, but it does mean there's no expectation for it.)

NiLuJe commented 5 years ago

@poire-z : Yeah, I wouldn't complexify the fallback mechanism. Unless someone one day decides to say "fuck it all" and switches to fontconfig ;p.

poire-z commented 4 years ago

So, bidi text support has been done with #309. There are still a few things to be done at the block rendering level, like list bullets, tables columns order, and global document direction - and/or some specific CSS support like text-align: start instead of left. I'm delaying the work on that, because I'm not quite sure yet how to go at it (CSS only - or handled by our FlowState object, saved in RenderRectAccessor - or grabbed from parents when needed...) Mostly because if I go at handling/storing dir=, I may as well handle/store lang= similarly, and that depends on what we could do with lang=.

And I've been contemplating the following idea: Instead of having global typography tunables at frontend (like: [ ] No break on space before punctuation (for French) [ ] No standalone single char at end of line (for Polish)), we could just have a set of typography rules & features per language, and crengine could request them from frontend when it meets a new lang= (from frontend, because the parsing/decision/building is a lot easier to do in Lua and would allow easier customization by users that if it were done on the crengine side).

So, we could have/generate some settings per language tag. We'd get as input the various form a lang tag can have: fr , fr-CA, zh-Latn, tr-Arab, and from that, we could guess/generate and return (converted to a C struct ready to be used by crengine):

{
libunibreak_lang_tag="zh",
libunibreak_custom_rules = { { 0x2018, 0x2018, LBP_CL } },
avoid_wrap_around = { "'", "/" },
para_default_direction="rtl"
harfbuzz_lang="zh" // For font variation
kerning_method_force = "harfbuzz", // for arabic, use harfbuzz/best even if not selected
hyphenation="French_Canada.pattern"
fallback_font = "NotoSansMyanmar-Regular.ttf"
fallback_font_url = "https://google.com/download/NotoSansMyanmar-Regular.ttf" // to propose download
line_min_orphan_chars_on_the_right= 2,
...
}

that crengine could store/save/hash, and associate that hash to each paragraph so these settings are used when laying out that paragraph, without having to make any decision about the language/chars inside crengine: just follows the rules given. It would request the default typography settings for the root node, and we could allow/forbid it to be overriden even if some lang= exists below in the document (like !important). On frontend, we could have a Typography> menu with languages (a bit like we have for Hyphenation: the ability to set one as the default or the fallback). If set as default, we would always return the settings for that language, no matter what input crengine gives. If set as fallback, we would use the crengine input, and if none: the book language from metadata, and if none, that fallback and if none, the UI language.

Another quicker or uglier option would be to just use CSS and have in our epub.css or in some style tweaks:

*{lang^="fr") { -cr-typography: hyphenation(French.pattern), avoid_wrap_around(') }
*{lang^="fr-CA") { -cr-typography: hyphenation(FrenchCanada.pattern); }
*{lang^="pl") { -cr-typography: line_min_orphan_chars_on_the_right(2); }
*{lang^="ar") { -cr-typography: fallback_font(FreeSerif.ttf); }
*{lang^="zh") { -cr-typography: fallback_font(NotoSansCJK.ttf); }
*{lang^="zh-TW") { -cr-typography: harfbuzz_lang_variation(tw); }
*{lang^="my") { -cr-typography: fallback_font(NotoSansMyanmar-Regular.ttf); }

That might be superoverkill as I don't know how much publishers do use lang= - and when they want good layout, they would probably just put images, given the limitation of the various renderers out there.

But I dunno, it feels like the right generic way to go at that, and may allow easy incremental addition/enabling/disabling of features.

(It feels also quite complex to implement, given that our hyphenation and fontmanager kerning method and fallback handling are global...)

yparitcher commented 4 years ago

@poire-z Thanks for the RTL\BIDI support, it is great. i do alot of hebrew and now can use Koreader on my kindle. i will adress your questions (as a user) in https://github.com/koreader/koreader/issues/5359

Frenzie commented 4 years ago

Edit: more https://developer.mozilla.org/en-US/docs/Tools/Page_Inspector/How_to/Edit_fonts

xelxebar commented 4 years ago

@poire-z Thank you for the ping. I am keeping an eye on this and am definitely still interested in getting vertical text working. At the moment, however, I am job searching, so this will have to sit on my backburner for a while.

poire-z commented 4 years ago

OK, i'm making some progress on some of that stuff.

It would request the default typography settings for the root node, and we could allow/forbid it to be overriden even if some lang= exists below in the document (like !important). On frontend, we could have a Typography> menu with languages (a bit like we have for Hyphenation: the ability to set one as the default or the fallback). If set as default, we would always return the settings for that language, no matter what input crengine gives. If set as fallback, we would use the crengine input, and if none: the book language from metadata, and if none, that fallback and if none, the UI language.

Would we all be ok with that ? Replacing our Hyphenation> menu (that sets a global hyphenation method) with a Typography> that would list the same languages, to select a language as the default/fallback typographic language for the book, which could look like:

I have to keep the now-legacy hyphen dict selection working (they will set a language and enable hyphenation) just to avoid the CoolReader devs to have to rework their various frontends - and for us, it would work the same way with our current readerhyphenation.lua - but I have minor issues with it (left/right minimale sizes, that CoolReader is not using) that I'd need to rework. So, would be best to rework the frontend code at the same time.

Caveats: all the languages/hyph dict names/mappings would be hardcoded into a textlang.cpp, so languages.json would no mode be used (and so, not allowing customisation that easily). (I suggested above about having all that configurable via CSS or via frontend, but that's quite much much unrelated work, and would require even more work from the CoolReader people if they want to keep in sync with us...)

So, asking for permission :) Would that switch and that new menu be understandable by users?

The technical idea is that each text node will get associated a TextLangCfg (from the selected language, or from the language specificed in an upper node lang=attr, and various text rendering bits of code would use members of this object instead of global defaults:

TextLangCfg::TextLangCfg( lString16 lang_tag ) {
    printf("TextLangCfg %s created\n", UnicodeToLocal(lang_tag).c_str());
    // Keep the provided and non-lowercase'd lang_tag
    _lang_tag = lang_tag;
    // But lowercase it for our tests
    lang_tag.lowercase();

    _hyph_method = TextLangMan::getHyphMethodForLang(lang_tag);

    // https://drafts.csswg.org/css-text-3/#script-tagging
    // XXX Check for Lant, Hant, Hrkt...

    // XXX 2nd fallback font

#if USE_HARFBUZZ==1
    _hb_language = hb_language_from_string(UnicodeToLocal(_lang_tag).c_str(), -1);
#endif

#if USE_LIBUNIBREAK==1
    _lb_char_sub_func = NULL;
    if ( lang_tag.startsWith("de") ) {
        _lb_props = (struct LineBreakProperties *) lb_prop_cre_German;
    }
    else {
        _lb_props = (struct LineBreakProperties *) lb_prop_cre_Generic;
    }
    if ( lang_tag.startsWith("pl") ) {
        _lb_char_sub_func = &lb_char_sub_func_polish;
        // XXX also for pl: double real hyphen at start of next line
    }
#endif
}

and some hardcoded stuff like:

static struct {
    const char * lang_tag;
    const char * hyph_filename_prefix; // We may have both current "Italian.pattern" and old "Italian_hyphen_(Alan).pdb"
    const char * hyph_filename;
    int left_hyphen_min;
    int right_hyphen_min;
} _hyph_dict_table[] = {
    { "bg", "Bulgarian", "Bulgarian.pattern", 2, 2 },
    { "ca", "Catalan", "Catalan.pattern", 2, 2 },
    { "cs", "Czech", "Czech.pattern", 2, 2 },

NiLuJe commented 4 years ago

Sounds good to me.

Assuming the default hyphenation dicts are decent, I'm not aware of anyone actively tweaking the list.

And if they do, it'd be a PR away ;).

Frenzie commented 4 years ago

Seems fine to me too.

poire-z commented 4 years ago

Found some excellent (I think :) resource on chinese typography in... chinese :/ https://www.thetype.com/kongque/ but Google Translate makes it quite alright to read in english: http://translate.google.com/translate?u=https%3A//www.thetype.com/kongque/&hl=en&langpair=auto|en&tbb=1&ie=UTF-8

For example, the article about hanging punctuation: http://translate.google.com/translate?u=https%3A//www.thetype.com/2017/11/13290/&hl=en&langpair=auto|en&tbb=1&ie=UTF-8 with many pictures of pages of books, or how other applications do it.

And when I thought I could just fix chineese spacing by squeezing punctuation glyphes (as they seemed to only occupy the left or right half of their squared glyph), I learn there that it depends on the language: in chinese, a ? is left aligned in its squared glyph, while in japanese, it is centered. And there are even more stuff centered (where I would need to eat on 1/4 on both sides of the squared glyph) in Traditional Chinese...

The same text, using the language specific glyphs:

(periods and commas are centered!)

Thanks to the TextLangMan typography stuff, we could delegate some decisions from lvtextfm.cpp to some typography specific functions per-language, but that means many things and combinations to test...

koreader / crengine