Rework and cleanup char categorization and word selection

koreader / crengine

This is the KOReader CREngine fork. It cross-pollinates with the official CoolReader repository at https://github.com/buggins/coolreader, in case you were looking for that one.

70 stars 45 forks source link

Rework and cleanup char categorization and word selection #472

Closed poire-z closed 2 years ago

poire-z commented 2 years ago

Followup to #468, hoping to get now a more consistent char categorization without duplicated code (that could contradict each other), and trusting Unicode (via utf8proc).

LVString: fix char props, add lStr_isCJK() and lStr_isRTL()

Reorganize CH_PROP_* flags: remove unused ones, combine some of them to possibly get new slots for the future. Fix some chars' CH_PROP, have getCharProp() correctly set them for other ranges thanks to utf8proc categories, even if some of these flags are not used yet. (Added CH_PROP_PUNCT_OPEN/CLOSE even if not used currently, it might be useful with CJK typography https://github.com/koreader/koreader/issues/6162.)

Factorize RTL detection in lStr_isRTL(). Factorize CJK detection in lStr_isCJK(). Remove wrong defines of CJK ranges: they were mostly used in legacy code (where we harcoded these bad ranges, not worth fixing them).

Rework `is/prev/next-VisibleWord-Start/End()`

Things look more robust and simple when thinking in terms of IsWordChar / IsWordBoundary, than as previously in terms of IsWordSeparator and handling the special case of CJK. Consider CH_PROP_SIGN (symbols, currency, math, emoji...) as single char words, just like CJK. (So that we are ready for the future when books will be written mostly with emojis, and we'll have emoji dictionaries, and the rare alpha words will have to be looked up in precious old dictionaries because nobody will read alphabetic any longer... 📚 🙈 😰 🧠👻 )

Remove no longer needed IsWordSeparator(), and not used thisVisibleWordEnd() (which, as it was not used, wasn't obvious what it should do). Rename badly named and duplicated canWrapWordBefore() -After() to IsStandaloneWord() (used by the otherwise untouched *Sentence* functions).

Note: the previous code was a bit convoluted when checking stuff. I had a hunch working with char/boundary would be better and more correct, and my initial rewrite was as convoluted as previously :) and I had a few bugs on edge cases. When going at solving them, I came with some really less convoluted code that began to look quite symmetric/similar in the 4 prev/next-VisibleWord-start/end, and actually going along with that code symmetricity was solving more edge cases ! :) So, I guess my hunch was a good one!

Will need a companion patch to cre.cpp:

@@ -1916,3 +1916,6 @@ static int getTextFromPositions(lua_State *L) {
                // But it is less OK with an initial long-press on a glyph, where we get a 50% chance of
-               // having the next char selected. Try to handle this case better:
+               // having the next char selected. Try to handle this case better.
+               // It turns out this also happens when some CJK/symbol char is near a multi-alpha word,
+               // so, when grabbing prev or next below, we use prevVisibleWordStart()/ nextVisibleWordEnd()
+               // instead of just moving offset by -1/1.
                if (r.getStart() == r.getEnd()) { // for single CJK character
@@ -1936,11 +1939,6 @@ static int getTextFromPositions(lua_State *L) {
                        if (grab_prev) {
-                               int offset = r.getStart().getOffset();
-                               if (offset > 0)
-                                       r.getStart().setOffset(offset - 1);
+                               r.getStart().prevVisibleWordStart();
                        }
                        else {
-                               int offset = r.getEnd().getOffset();
-                               if (offset < textLen)
-                                       r.getEnd().setOffset(offset + 1);
-                                               // (offset can be 0 .. textlen, textlen means after the last char)
+                               r.getEnd().nextVisibleWordEnd();
                        }

This solves the existing issue of long-press on an alpha word among CJK was grabbing the previous CJK char:

getCursorRect(): fix bottom point check

Use docToWindowPoint(... isRectBottom=true) from 939cfc2d0 when checking bottomRight point. (This fixes some issues with CJK/Symbol text selection on the bottom line of a page.)

This change is

poire-z commented 2 years ago

Merging, but waiting a bit for bumping, until we're sure v2022.03.1 is fine.

poire-z commented 2 years ago

Mhhhh, first commit caused some unwelcome change... https://github.com/koreader/koreader/issues/6162?myhs=2022-05-10T15:46:40Z#issuecomment-1122566479 Previously, some CJK punctuations were - when splitting words - not considered to be a standalone CJK char - so they were part of a single crengine-word with the CJK glyph before (I don't think this was ever explicited a decision - it would also happen with a opening punctuation). Now, they get to be standalone like all CKJ chars.

And it happens that when specifying the language to use to Harfbuzz, so it could pick localised glyphs (ie, the middle dot is different in simplified chinese vs traditional chinese vs japanese), it behave differently: it doesn't pick the traditional chinese glyph when shaped standalone - but it does when shaped with another preceding char...

$ ./hb-shape resources/fonts/noto/NotoSansCJKsc-Regular.otf --language "zh-Hans" -u "3FDD 3002"
[[gid5894=0+1000|gid1391=1+1000]]

$ ./hb-shape resources/fonts/noto/NotoSansCJKsc-Regular.otf --language "zh-Hant" -u "3FDD 3002"
[[gid5894=0+1000|gid63031=1+1000]]     so, gid63031 must be the Traditional Chinese glyph

$ ./hb-shape resources/fonts/noto/NotoSansCJKsc-Regular.otf --language "zh-Hans" -u "3002"
[[gid1391=0+1000]]

$ ./hb-shape resources/fonts/noto/NotoSansCJKsc-Regular.otf --language "zh-Hant" -u "3002"
[[gid1391=0+1000]]     same glyph as with Simplified Chinese :(

I don't know if it's a font bug, or a Harfbuzz bug, or just all as it should all naturally work and we should then handle that in our code. (And we can't really upgrade the font as the github versions were buggy, and we can't easily upgrade Harfbuzz because gcc versions...)

Also happens with this version https://github.com/googlefonts/noto-cjk/blob/main/Sans/OTF/SimplifiedChinese/NotoSansCJKsc-Regular.otf

poire-z commented 2 years ago

Does not happen with other fonts and standalone chars:

$ ./hb-shape Vollkorn-Regular.ttf --language "ru" -u "0432"
[[uni0432=0+507]]
$ ./hb-shape Vollkorn-Regular.ttf --language "bg" -u "0432"
[[uni0432.loclBGR=0+511]]

and does not happen with NotoSansCJKsc (old and new) with another normal glyph standalone

$ ./hb-shape ./NotoSansCJKsc-Regular.otf --language "zh-Hans" -u "53CA"
[[gid11862=0+1000]]
$ .hb-shape ./NotoSansCJKsc-Regular.otf --language "zh-Hant" -u "53CA"
[[gid11863=0+1000]]
$ ./hb-shape ./NotoSansCJKsc-Regular.otf --language "ja" -u "53CA"
[[gid11861=0+1000]]

So, I guess no Harfbuzz bug, and it's the font that decides that when requesting U+3002 standalone, even if traditional chinese, we should get the bottom left anchored simplified chinese version.... Which means we should handle this on the crengine side to include the punctuation with its preceding/following word... :/

poire-z commented 2 years ago

Does not happen with another (non-punctuation) glyph standalone, with old and new NotoSansCJKsc:

$ ./hb-shape resources/fonts/noto/NotoSansCJKsc-Regular.otf --language "zh-Hans" -u "53CA"
[[gid12282=0+1000]]
$ ./hb-shape resources/fonts/noto/NotoSansCJKsc-Regular.otf --language "zh-Hant" -u "53CA"
[[gid12283=0+1000]]
$ ./hb-shape resources/fonts/noto/NotoSansCJKsc-Regular.otf --language "ja" -u "53CA"
[[gid12281=0+1000]]

and also not with Volkorn.ttf and a cyrillic char looking different whether "ru" or "bg" standalone.

So, probably not a Harfbuzz bug, but a decision from NotoSansCJKsc to keep showing a standalone SC dot bottom-left-anchored even when TC requested :/ Which means we should handle this on the crengine side :/ (which sucks, but is probably simpler than having to explain/negotiate with the Noto people :)

Any idea if there exists another CJK font with OTF localized glyphs ?

poire-z commented 2 years ago

(Fuck github, didn't see my 2nd post, so thought I forgot to hit "Comment" and rewrote it... and ended up with 3...)

Any idea if there exists another CJK font with OTF localized glyphs ?

poire-z commented 2 years ago

So, probably not a Harfbuzz bug, but a decision from NotoSansCJKsc to keep showing a standalone SC dot bottom-left-anchored even when TC requested :/ Which means we should handle this on the crengine side :/ (which sucks, but is probably simpler than having to explain/negotiate with the Noto people :)

But that means I now don't have the liberty to freely position/adjust glyphs and reduce/shift punctuation like I envisionned in https://github.com/koreader/koreader/issues/6162 :(

poire-z commented 2 years ago

Feels like it is a NotoSansCJK bug ...

If I download and compare with the current Traditional Chinese version: https://github.com/googlefonts/noto-cjk/blob/main/Sans/OTF/TraditionalChinese/NotoSansCJKtc-Regular.otf

$ ./hb-shape ./NotoSansCJKsc-Regular.otf --language "zh-Hans" -u "3FDD 3002"
[[gid5709=0+1000|gid1398=1+1000]]
$ ./hb-shape ./NotoSansCJKsc-Regular.otf --language "zh-Hant" -u "3FDD 3002"
[[gid5709=0+1000|gid63146=1+1000]]

$ ./hb-shape ./NotoSansCJKtc-Regular.otf --language "zh-Hans" -u "3FDD 3002"
[[gid5709=0+1000|gid1398=1+1000]]
$ ./hb-shape ./NotoSansCJKtc-Regular.otf --language "zh-Hant" -u "3FDD 3002"
[[gid5709=0+1000|gid63146=1+1000]]

the 2 fonts are consistent, and obey the provided language.

But they are not consistent and don't obey the provided language when shaping U+3002 standalone:

$ ./hb-shape ./NotoSansCJKsc-Regular.otf --language "zh-Hans" -u "3002"
[[gid1398=0+1000]]
$ ./hb-shape ./NotoSansCJKsc-Regular.otf --language "zh-Hant" -u "3002"
[[gid1398=0+1000]]
$ ./hb-shape ./NotoSansCJKtc-Regular.otf --language "zh-Hans" -u "3002"
[[gid63146=0+1000]]
$ ./hb-shape ./NotoSansCJKtc-Regular.otf --language "zh-Hant" -u "3002"
[[gid63146=0+1000]]

But they are consistent and obey the provided language when shaping U+53CAstandalone:

$ ./hb-shape ./NotoSansCJKsc-Regular.otf --language "zh-Hans" -u "53CA"
[[gid11862=0+1000]]
$ ./hb-shape ./NotoSansCJKsc-Regular.otf --language "zh-Hant" -u "53CA"
[[gid11863=0+1000]]
$ ./hb-shape ./NotoSansCJKtc-Regular.otf --language "zh-Hans" -u "53CA"
[[gid11862=0+1000]]
$ ./hb-shape ./NotoSansCJKtc-Regular.otf --language "zh-Hant" -u "53CA"
[[gid11863=0+1000]]

poire-z commented 2 years ago

Of interest: https://raw.githubusercontent.com/adobe-fonts/source-han-sans/release/SourceHanSansReadMe.pdf I get the exact same behaviour with this SourceHanSansHWSC-Regular.otf :/

$ ./hb-shape ./SourceHanSansHWSC-Regular.otf --language "zh-Hans" -u "53CA"
[[gid11862=0+1000]]
$ ./hb-shape ./SourceHanSansHWSC-Regular.otf --language "zh-Hant" -u "53CA"
[[gid11863=0+1000]]

$ ./hb-shape ./SourceHanSansHWSC-Regular.otf --language "zh-Hans" -u "3002"
[[gid1398=0+1000]]
$ ./hb-shape ./SourceHanSansHWSC-Regular.otf --language "zh-Hant" -u "3002"
[[gid1398=0+1000]]

$ ./hb-shape ./SourceHanSansHWSC-Regular.otf --language "zh-Hans" -u "3FDD 3002"
[[gid5709=0+1000|gid1398=1+1000]]
$ ./hb-shape ./SourceHanSansHWSC-Regular.otf --language "zh-Hant" -u "3FDD 3002"
[[gid5709=0+1000|gid63146=1+1000]]

Frenzie commented 2 years ago

Ouch, that's annoying. :-/

khaledhosny commented 2 years ago

it doesn't pick the traditional chinese glyph when shaped standalone - but it does when shaped with another preceding char...

You need to pass --script=hani since U+3002 has ”Common” and hb-shape couldn’t detect what script it belongs to.

khaledhosny commented 2 years ago

hb-shape uses hb_buffer_guess_segment_properties(), if this what your code also uses then you need a proper Unicode script detection implementation that operates on a the paragraph as a whole, otherwise isolated text segments might not get the proper script (especially when they contain or are composed solely of Common and Inherit code points).

poire-z commented 2 years ago

Oh, thanks ! It indeed works when giving --script "hani" :

$ ./hb-shape ./NotoSansCJKsc-Regular.otf --language "zh-Hans" --script "hani" -u "3002"
[[gid1398=0+1000]]
$ ./hb-shape ./NotoSansCJKsc-Regular.otf --language "zh-Hant" --script "hani" -u "3002"
[[gid63146=0+1000]]

Glad it's not a font issue, and that we can eventually solve it on our side :)

hb-shape uses hb_buffer_guess_segment_properties(), if this what your code also uses

We indeed do: https://github.com/koreader/crengine/blob/de482f93541f50509ba1d76fd3145ccdaf37e60a/crengine/src/lvfntman.cpp#L2475-L2479

then you need a proper Unicode script detection implementation that operates on a the paragraph as a whole, otherwise isolated text segments might not get the proper script (especially when they contain or are composed solely of Common and Inherit code points).

We get our language from HTML lang= attribute (that we give as is to HarfBuzz), so we may get multiple scripts/languages in a paragraph. In a first step, to measure widths before line breaking, we give a full text node to Harfbuzz, so it gets a chance to correctly detect the script. Then, we cut lines and "words" and position them (so, each CJK char is a "word", that we can position precisely for text justification along the line). In the later step of drawing (quite decorelated from the first step...), we only handle "words", and give each of them to Harfbuzz - so Harfbuzz will then get each CJK char, with the correct language, but without more chars context to correctly guess the script, as you say :/

But, couldn't Harfbuzz guess the script from a language ? I guess lang="zh" is ambiguous, but lang="zh-Hans" is explicite, and the script is "hani" and not "kana", isn't it ? But we indeed can get just lang="zh" from the source text, so we need to do something. We could carry the script HB guessed in my first step above and associate it to each "word" we made (so HB doesn't have to guess it in the drawing step) - but this looks complicated/expensive (or may be not, we may already be also guessing the script in the word-making phase: https://github.com/koreader/crengine/blob/de482f93541f50509ba1d76fd3145ccdaf37e60a/crengine/src/lvtextfm.cpp#L3096-L3106).

Or give more context to Harfbuzz and more text from this text node around the (single) glyph we may draw (I think HB allows that), but that feels a bit excessive/convoluted (for our code). And how much (the full text node may be large) ?

Or, as the issue is (it seems) only when we get/draw a word made from a single CJK char/glyph (as for most other languages/scripts, we'll make "words" with multiple chars, that could include any leading/trailing punctuation, and will provide more context) , we could simply get the script from the lang=(zh, ja, ko) from a harccoded mapping or little bit of code (in our textlang.cpp, that looks perfect for that kind of tricks). If it's CJK, and lang=zh*, script=hani. If lang=ja, script=kana...

poire-z commented 2 years ago

I'm a bit confused about the necessity to provide a script. Or rather, why the font would give out the correct locl/lang glyph for the specified language only when a specific script is set, and not by default.

Working script tags:

$ ./hb-shape ./NotoSansCJKsc-Regular.otf --language "zh-Hant" --script "hani" -u "3002"
[[gid63146=0+1000]]
$ ./hb-shape ./NotoSansCJKsc-Regular.otf --language "zh-Hant" --script "latn" -u "3002"
[[gid63146=0+1000]]
$ ./hb-shape ./NotoSansCJKsc-Regular.otf --language "zh-Hant" --script "grek" -u "3002"
[[gid63146=0+1000]]
$ ./hb-shape ./NotoSansCJKsc-Regular.otf --language "zh-Hant" --script "kana" -u "3002"
[[gid63146=0+1000]]

Not working script tags:

$ ./hb-shape ./NotoSansCJKsc-Regular.otf --language "zh-Hant" --script "arab" -u "3002"
[[gid1398=0+1000]]
$ ./hb-shape ./NotoSansCJKsc-Regular.otf --language "zh-Hant" --script "geor" -u "3002"
[[gid1398=0+1000]]
$ ./hb-shape ./NotoSansCJKsc-Regular.otf --language "zh-Hant" --script "blah" -u "3002"
[[gid1398=0+1000]]
$ ./hb-shape ./NotoSansCJKsc-Regular.otf --language "zh-Hant" --script "zyyy" -u "3002"
[[gid1398=0+1000]]

I understand the script helps HB chosing the right shaper/code - so is it a badly chosen shaper that makes the font not spit out the expected glyph ?

Or is it something with OpenType, as I read at https://simoncozens.github.io/fonts-and-layout/features.html :

Inside the font, the GSUB and GPOS tables are arranged first by script, then by language, and finally by feature

so, that may be to reach the language tables locl stuff, you need to first lookup a script ? Isn't there a "default" script/road ? And if yes, why would NotoSansCJKs not make it all work with a default script ?

Whose fault is it? (if not mine :)

poire-z commented 2 years ago

Did a few tests, logging sequence of chars that end up having HB_SCRIPT_INVALID. They are rare.

When using the various OpenType *num features, they seem to work in spite of the script being INVALID:

HB_SCRIPT_INVALID input len=6 31 31 31 36 34 30
HB_SCRIPT_INVALID input len=1 20
HB_SCRIPT_INVALID input len=6 31 31 31 36 34 30
HB_SCRIPT_INVALID input len=1 20
HB_SCRIPT_INVALID input len=6 31 31 31 36 34 30
HB_SCRIPT_INVALID input len=1 20
HB_SCRIPT_INVALID input len=6 31 31 31 36 34 30
HB_SCRIPT_INVALID input len=1 20
HB_SCRIPT_INVALID input len=6 31 31 31 36 34 30
HB_SCRIPT_INVALID input len=1 20
HB_SCRIPT_INVALID input len=6 31 31 31 36 34 30
HB_SCRIPT_INVALID input len=1 20
HB_SCRIPT_INVALID input len=6 31 31 31 36 34 30

So, it's not that HB_SCRIPT_INVALID makes nothing work: some features like tnum, onum works. But indeed, locl / language stuff doesn't work (which I don't really understand why they should not work).

Also, about one of my above ideas, of remembering and using previously script detected in a paragraph: with this Korean sample, where the paragraph starts with: HB_SCRIPT_INVALID input len=5 2018 34 b7 31 20 the "word" in red in there: We don't have yet a script if this is at the start of a paragraph. We could/should then use the script detected for the later chars. But then, because of any possible BiDi reordering, we'd need to have a map for the scripts detected for each char so we can reorder it too, and get the script of the chars that will come visually after, to shape the thing in red that will still be visually and drawn first. It's quite complicated and tedious for something that will happen rarely...

I thought of just using this, which is enough to solve my issues with Traditional Chinese punctuations.

             if ( lang_cfg ) {
                 hb_buffer_set_language(_hb_buffer, lang_cfg->getHBLanguage());
             }
             // Let HB guess what's not been set (script, direction, language)
             hb_buffer_guess_segment_properties(_hb_buffer);
+            // In case HB couldn't guess a script from the unicode chars we added to its buffer,
+            // (which can happen when we give it a single CJK punctuation which would be considered
+            // as script COMMON, or a sequence of digits and punctuations), make sure we have HB aware
+            // of some valid script, so that at least the 'locl' feature works and is able to provide
+            // glyphs for the requested hb_language.
+            if ( hb_buffer_get_script(_hb_buffer) == HB_SCRIPT_INVALID ) {
+                printf("HB_SCRIPT_INVALID input len=%d", len); for (int i = 0; i < len; i++)  printf(" %x", text[i]); printf("\n");
+                // Provide "latn", which has the best chance to be known by the font and have
+                // its OpenType features working.
+                hb_buffer_set_script(_hb_buffer, HB_SCRIPT_LATIN);
+            }
+            // printf("guessed script %d (%d %d %d %d | %d %d %d %d)\n", seg_props.script,
+            //     HB_SCRIPT_INVALID, HB_SCRIPT_COMMON, HB_SCRIPT_INHERITED, HB_SCRIPT_UNKNOWN,
+            //     HB_SCRIPT_LATIN, HB_SCRIPT_HAN, HB_SCRIPT_HIRAGANA, HB_SCRIPT_KATAKANA);

@khaledhosny : do you foresee that this is really a bad idea ? Or is it a good enough solution ? (we don't really target "perfect", just "good enough" :) Are my comments in this code snippet correct (ie. "Provide "latn", which has the best chance to be known by the font and have its OpenType features working." , or am I just out of my league and inventing an alternate story about OpenType & HarfBuzz ? :)

khaledhosny commented 2 years ago

I'm a bit confused about the necessity to provide a script.

OpenType features are agonized by script and language, some fonts also add the features to DFLT script that can be used as fallback when the requested script does not exist in the font, but Noto CJK fonts don’t do this for the locl feature, which is understandable since the font supports more than one script and you want to enable the localized substitutions for certain script/language pair (suppose that text written in Latin script, say in pinyin, was tagged with Chinese language, you wouldn’t want the same localized forms of the punctuation as in Chinese written in CJK).

khaledhosny commented 2 years ago

do you foresee that this is really a bad idea

I’m afraid the code as it stands does not support mixed script at it, if it appears to be working it is by sheer luck. Suppose the text is mixed Hebrew and Arabic and it happens to be starting with the Hebrew text, the guessed script will be Hebrew even for the Arabic parts and it will be broken (and vice versa, but it will be more obvious/sever for Arabic).

poire-z commented 2 years ago

I’m afraid the code as it stands does not support mixed script at it, if it appears to be working it is by sheer luck.

Rather by hard work :)

Suppose the text is mixed Hebrew and Arabic and it happens to be starting with the Hebrew text, the guessed script will be Hebrew even for the Arabic parts and it will be broken (and vice versa, but it will be more obvious/sever for Arabic).

We do that well - it's the 2 steps I outlined above that may make our implementation a bit unconventional. In the first step I outlined above, we split "runs" (to shape each of them individually) on a long text node on these conditions: https://github.com/koreader/crengine/blob/de482f93541f50509ba1d76fd3145ccdaf37e60a/crengine/src/lvtextfm.cpp#L1840-L1851 so this step is hopefully right :) For each of these runs, we make "words", small standalone units of text that we can have their x shifted along the line for text justification, and we later shape and draw each individually. It's at this drawing step that we get short segments (but from a same script/language/font run from the first step) that HarfBuzz may then not be able to guess the script.

Firefox:

KOReader:

OpenType features are agonized by script and language, some fonts also add the features to DFLT script that can be used as fallback when the requested script does not exist in the font, but Noto CJK fonts don’t do this for the locl feature,

OK, thank you, that confirms what I had guessed.

which is understandable since the font supports more than one script and you want to enable the localized substitutions for certain script/language pair (suppose that text written in Latin script, say in pinyin, was tagged with Chinese language, you wouldn’t want the same localized forms of the punctuation as in Chinese written in CJK).

But actually, Noto Sans CJK does enable the localized forms when I set lang=latn or lang=grek (as shown at https://github.com/koreader/crengine/pull/472#issuecomment-1123030891). So, my feeling is that it just "forgot" to set them to DFLT script, and that it might be better (for the application, if not for HarfBuzz if it wants to stay "pure") to fallback to forcing script=LATIN when the script couldn't be detected (mostly for small runs of digits, symbols, punctuations and emoji I guess) as I guess all fonts (right?) would have some support for this LATIN script, and as seen with NotoSansCJK, we may get better luck at activating features when using LATIN instead of DLFT.

khaledhosny commented 2 years ago

So I think you need to carry over the script and direction from the segmentation step and do away with segment property guessing.

The script itemization step need to resolve Common and Inherit script properties (if it doesn't already) based on its sorroundings, there is no well defined algorithm for this, but you can check what libraqm and Pango do for ideas.

koreader / crengine