Inefficiency in libunibreak's `get_char_lb_class_lang`

bbshelper commented 2 years ago

Callgrind profiling suggests that more than 10% of total rendering time is spent in calls to get_char_lb_class_lang. Of course the number is dependent on the number of characters in a book.

libunibreak uses a 2-level table when looking up line break property in lb_prop_default. Both levels use linear search. For most codepoints, it requires like dozens of comparisons, which is quite suboptimal. Several thoughts here:

Use a table for BMP. I made a (unpolished) change. The lookup is 10x faster in my testing, which means ~10% savings in total render time. Not sure upstream will accept it, though. Also there is space considerations on low end devices.
Binary search. Should be better on average, but not very much for western languages, since they are near the top of the table.
Don't call the function on every character. I don't know whether this is doable, still new to linebreak algorithm. Maybe we only need the info on several characters near the end of each line?

poire-z commented 2 years ago

libunibreak uses a 2-level table when looking up line break property in lb_prop_default. Both levels use linear search. For most codepoints, it requires like dozens of comparisons, which is quite suboptimal.

From what I get quickly looking at libunibreak code, it builds a 70x40 tables of tables. So, for ascii/latin/cyrillic, it will quickly find the first table index (one or 2 comparisons) and then process iterating the found 40 items subtable. For Chinese, it may have to iterate 30 items in the first table to then find the 40 items table to iterate (as for latin). So, I guess you get this penaly more with CJK than us with English or French ? (May be you could hardcode some entry point to the first table, if if ch >= 0x4E00, jump directly to first table index 27 :)

Don't call the function on every character. I don't know whether this is doable, still new to linebreak algorithm. Maybe we only need the info on several characters near the end of each line?

We do in in lvtexfm.cpp in copyText() when we iterate the whole paragraph's text (before knowing where we could break a line) and set some flags. It would feel a bit convoluted doing it at a later stage, where the logic is to work only with these flags. Libunibreak (and the UAX#14 Algo it implements) works by walking forward, keeping some little state from the chars seen before to decide if we can break before the coming up char. So, we can't really do it backward (and I wouldn't like hacking this algo :) making dirty optimisations like: we could rewind only 10 chars assuming we'll meet a space or some other char that would always reset the state machine and not depend on what's before - and if not found, rewind 10 other chars :)

Use a table for BMP. I made a (unpolished) change. The lookup is 10x faster in my testing, which means ~10% savings in total render time. Not sure upstream will accept it, though. Also there is space considerations on low end devices.

I'm totallly fine with this idea. No need to propose it to upstream, we don't forbid ourselves to patch upstream libraries, which we already do for libunibreak (currently just to expose some internal properties that crengine textlang.cpp needs). So, your idea would add 65KB to libunibreak.so, right ? 140KB vs 205KB, wouldn't hurt much :) (and it would be in the "text"/static part of the library, so no real more memory usage I guess).

Looking at your diff, I see your lb_prop_supplementary has 938 items. Looking at libunibreak src/linebreakdata.c, I see only 668 lines after 0x10000. Any idea why ? (Also, even if chars >0x10000 are hardly seen (except may be for some scripts), it feels condescending to them to now do 668 comparisons instead of <40+<40 before :) You could keep the original logic of 2-level tables (25x25) and use it only for > 0x10000 , which would also limit the diffs. Just mentionning it, not insisting on that :) (Your diff may look nicer if you keep your hexadecimal abcdef uppercase like libunibreak ABCDE.)

You could also have a look at utf8proc and harfbuzz among our thirdparty libraries: they must also use/build/generate tables for various unicode char properties. (I see some of them, but too lazy to look at how they work and accelerate things.)

bbshelper commented 2 years ago

So, I guess you get this penaly more with CJK than us with English or French ?

Intuitively speaking, as CJK has some large consecutive blocks so level-2 lookup should be quicker, and CJK books tend to have fewer character count than English. but I don't bother to make a finer estimation. :)

So, your idea would add 65KB to libunibreak.so, right ?

The diff is more like ~40KB because ~2k 12byte BMP entries inlb_prop_default are removed. My host libunibreak.so build increased from 101,512B to 141,784B

I see your lb_prop_supplementary has 938 items. Looking at libunibreak src/linebreakdata.c, I see only 668 lines after 0x10000. Any idea why ?

Possible reason is that I generated with Unicode 15.0, the upstream is 14.0. Or maybe a bug in my script :p

Also, even if chars >0x10000 are hardly seen (except may be for some scripts), it feels condescending to them to now do 668 comparisons instead of <40+<40 before

Yes. I intend to use binary search on supplementary planes, just haven't implemented yet. The current patch is a PoC to see if my optimization works for BMP.

Your diff may look nicer if you keep your hexadecimal abcdef uppercase like libunibreak ABCDE.

I call it midnight laziness :p

You could also have a look at utf8proc and harfbuzz among our thirdparty libraries

In my simple test they don't play a significant role in rendering time. Freetype is like 4% with a long list of functions. Definitely will take a look should they rank high in future experiments.

poire-z commented 2 years ago

You could also have a look at utf8proc and harfbuzz among our thirdparty libraries

I meant: look at them only for how they handle efficient codepoint to unicode properties, for ideas - although there is probably no faster way of doing it than the way you're going with.

Possible reason is that I generated with Unicode 15.0, the upstream is 14.0.

I think we shouldn't go too diverging from upstream libunibreak. They will one day update to Unicode 15, we don't need to do it yet. And we want to keep it easy for us or our grandchildren following upstream libunibreak releases :) Your python script could be called from our CMake file to generate what you need from the existing libunibreak files. And may be, to keep the diff smaller, you could just not touch their map, and just use lb_prop_supplementary = lb_prop_default[2126] (or the correct C way of making it start later), or just a /* and */ around those you don't need.

Frenzie commented 2 years ago

Your python script could be called from our CMake file to generate what you need from the existing libunibreak files.

Hold on a second, I don't think Python is a build dependency at this time (although we do happen to include it in our Docker images).

bbshelper commented 1 year ago

I meant: look at them only for how they handle efficient codepoint to unicode properties, for ideas - although there is probably no faster way of doing it than the way you're going with.

Indeed. They both have some elaborated tables to store data, especially harfbuzz: https://github.com/harfbuzz/packtab

I think we shouldn't go too diverging from upstream libunibreak.

Maybe we should try upstream the change.

And may be, to keep the diff smaller, you could just not touch their map, and just use lb_prop_supplementary = lb_prop_default[2126] (or the correct C way of making it start later), or just a /* and */ around those you don't need.

Nice trick.

poire-z commented 1 year ago

Done at upstream, and for KOReader with the libunibreak update https://github.com/koreader/koreader-base/pull/1563.

From after #495 and #493:

Some big book (3000 items in the zip) I have used over the years for benchmarking (and that I remember took 70 seconds loading at some point in the past), was taking these days 64 seconds. With this PR and https://github.com/koreader/crengine/pull/493, it now takes 48 seconds on my Kobo.

With the libunibreak update, I went fro 48 s to 45 s.

koreader / crengine

Inefficiency in libunibreak's `get_char_lb_class_lang` #494