koreader / crengine

This is the KOReader CREngine fork. It cross-pollinates with the official CoolReader repository at https://github.com/buggins/coolreader, in case you were looking for that one.
70 stars 45 forks source link

Improve css class matching #545

Closed bbshelper closed 9 months ago

bbshelper commented 10 months ago

This is my first PR on https://github.com/koreader/crengine/issues/529

I've tested with load and render, but I've no knowledge of cache mechanism yet so not sure it works there. I publish them for review now because I want to avoid conflicts to subsequent commit(s) :)

Tested with two sets of books from Project Gutenburg, and cycle estimations of LoadDocument() reported by callgrind are as follows (fewer is better):

Percentile Largest Books Median Books
Best 10% -39.2% -11.3%
Median -6.0% -3.3%
Worst 10% -1.6% -0.2%

This change is Reviewable

poire-z commented 10 months ago

A bit hard to get into with the existing, and your, names getAttribute/Value/Index, so better explicite naming may make it easier to follow. But I get the idea, and nothing against.

I've tested with load and render, but I've no knowledge of cache mechanism yet so not sure it works there.

It will probably fails with a non-small book, after you open and exit it (so a cache is created), and you re-open it (so, loading the DOM and attributes/values indexes from the cache) and re-render it: your _classAttributeTable will be empty, and some CSS may then not match (or use the slower class checking path).

You would either need to: A rebuild it when about to be needed (from iterating all the nodes, or all the existing class atribute values (which are cached) if it is possible (dunno). B serialize/deserialize it in/from the cache

We do A for _urlImageMap and lists (for the get/setNodeNumberingProps() methods). We do B for most of the other id/name/attribute/values stuff (see lxmlDocBase::serializeMaps() / ::deserializeMaps() ).

You may have a hard time de/serializing LVHashTable<lUInt32, LVArray<lUInt32> *, true> (dunno). May be storing your LVArray<lUInt32> as a lString32could make it simpler, and using :String32:pos() to find it instead of iterating the LVArray<lUInt32>. I guess B would only store things for nodes with class="multiple values", and nothing when class="singlevalue" - so it shouldn't take much space in the cache and in RAM when deserialized. A would be better if 1) it would take room in RAM and we only need when rerendering - and 2) it's fast to compute.

(Thanks for not having me read std::hash(std::array) :)

About #529: looks like this is not one of the three ideas you mentionned? Any thought about my comment in there? Especially about hashing each classname, so getting a lUint32. No need to cache anything. When in stylesheet.apply(), and for each node, we would compute the array of lUInt32 at start (like you do when you fetch them from _classAttributeTable), and when making the stylesheet, you would just save a lUInt32 in a LvCssSelectorRule._hashed_value instead of your _index ?

bbshelper commented 10 months ago

Thanks for the insights on the cache part. It will take some time before I can submit a working patch.

(Thanks for not having me read std::hash(std::array) :)

That's no accident, my dear friend :)

About #529: looks like this is not one of the three ideas you mentionned?

This is actually a scaled back version of my first idea. I experimented with hash, and also ID selectors, but haven't got a good result (yet).

Any thought about my comment in there?

Apologies for not replying further. It seemed I didn't make myself clear so I'd rather talk with code :p

Especially about hashing each classname, so getting a lUint32. No need to cache anything. When in stylesheet.apply(), and for each node, we would compute the array of lUInt32 at start (like you do when you fetch them from _classAttributeTable), and when making the stylesheet, you would just save a lUInt32 in a LvCssSelectorRule._hashed_value instead of your _index ?

I didn't fully get your idea in #529 until now :) With lString32::getHash(), there might be collisions. For something like my quickCheck() its fine, but with the uniqueness provided by _classAttributeTable I can possibly further remove the recheck in LVCssSelectorRule::check(). On the other hand, it saves the time and space and complexity of _classAttributeTable, so it could be a gain over my implementation. I'll need more experiment and profiling though.

NiLuJe commented 10 months ago

IIRC, MuPDF uses perfect hashing via gperf for those ;).

poire-z commented 10 months ago

^ You mentionned that previously in #348. We also ship a crengine/include/xxhash.h, that is used only in the following, which looks like it's what would be needed for our topic. https://github.com/koreader/crengine/blob/156d095847b608a2e516f1d84fbb9746b7ac791f/crengine/src/lvtinydom.cpp#L382-L385 and that we could reuse I guess if we're not sure of our lString8::getHash() and collisions.

bbshelper commented 10 months ago

IIRC, MuPDF uses perfect hashing via gperf for those ;).

MuPDF uses that for their CSS properties, which is known at compile time. In our case, the class names are dynamic, and I'm not sure gperf supports that.

bbshelper commented 10 months ago

So I followed @poire-z 's input and implemented a version with lString32's own hash. It's slightly slower (typically less than 1% for LoadDocument()) but much simpler. I'd like to go with this version.

poire-z commented 10 months ago

It's slightly slower (typically less than 1% for LoadDocument())

Any idea why ? It feels computing these hashes would be less intensive than fetching things from your previous _classAttrValue LVHashTable that would need some allocations on creation and iterations when looking/fetching stuff.

but much simpler. I'd like to go with this version.

Sure, I like it better for this reason.

bbshelper commented 10 months ago

It's slightly slower (typically less than 1% for LoadDocument())

Any idea why ? It feels computing these hashes would be less intensive than fetching things from your previous _classAttrValue LVHashTable that would need some allocations on creation and iterations when looking/fetching stuff.

Well, to answer your question, I rerun the profiling and found that, for the whole load and render process, the new version is actually 0.18% faster (median) than the old one, and only 9% are slower. The old profiling data got overwritten so I can't really tell what went wrong.

Memory allocation changes can ripple through the whole program with callgrind profiling. I did some manual comparison but didn't reach any conclusion. However, when a node has only one single class name, the old version doesn't calculate the attribute hash, or allocate an LVArray. This might help explain those few cases where the new version is slower.

poire-z commented 10 months ago

but much simpler. I'd like to go with this version.

I'm now realizing that this simple solution won't allow for more similar optimisations :/

We would only be optimising the class check for the current node we are checking if it has a class= attribute, so optimising only .foo or div.bar, because of the work that you only do at the entry point (around our _selector_0 we were talking about). We won't be able to reuse much of your code (your initial idea or the simpler followup) if we want to optimize: div.container p (which are really common in most books with large stylesheets I've seen) which go checking other LVCssSelectorRules of the same LVCssSelector, but with parent nodes of our initial node, ie. in all the case cssrt_parent, cssrt_ancessor...

If we want to optmize that too, we'd need to move the quick class hash checking in there: https://github.com/koreader/crengine/blob/22191fda2f1d5c8f3c63fedb6e414bb74e0e9a36/crengine/src/lvstsheet.cpp#L5166-L5186 actually replacing the string comparison here with the "hash present in an array of hashes" test (so, no need for a rarely used quickCheck() - just a quick regular check() :)

So, maybe by reusing/tweaking some ideas from your initial implementation:

What do you think ? Shouldn't be too hard to implement and try - and I guess that if you already had such % speedup numbers by only doing the optimisation for the tail end, you'll get even more impressive ones if you can get it done for all .class checks on ancestors.

bbshelper commented 10 months ago

I've been experimenting the cache thing for a while. It is actually more involved than I expected. The patch is still far from reviewable, but I can share my two cents now since you ask:

  1. Selectors like div.foo p is quite common and cannot be optimized by this commit.

  2. Other than class selectors, there are excessive calls to some ldomNode methods, such as getNodeId(), likely caused by ancestor rules (other combinators may contribute too, but ancestor rules are more common). They are not slow per se (if not persisted), but can accumulate to several % to total load and render time.

My idea is to introduce a cache class (FatNode), which should behave just like ldomNode, but caches some results in it, like getNodeId(), or isRoot().

In both LoadDocument() and render(), we apply styles in a depth first way. It's easy to maintain a list of FatNodes from the root node to the current one (AncestorCache). With this cache, methods like FatNode::getParentNode() can easily return a FatNode *. This should help a lot with ancestor rules.

Just as you said, the cache don't need to be stored in document cache. They are constructed on the fly during style calculation and discarded afterwards.

I don't think the current commit could interfere with this design. We just need to cache the hash values in FatNode.

Implementation challenges:

  1. sibling combinators: FatNode::getUnboxedPrevSibling() can only return a ldomNode as its siblings are not in AncestorCache. They probably shouldn't be. This makes FatNode not a drop-in replacement for ldomNode in style caclulation, so there need to be 2 versions for most functions in lvstsheet.cpp. Template helps, but still.

  2. call stack is complex: between LoadDocument()/render() (which should own the AncestorCache) and LVStyleSheet::apply(), there are initNodeStyleRecursive(), updateStyleDataRecursive(), initNodeStyle(), setNodeStyle(), to name a few. AncestorCache have to be passed around and maintained along the way, a lot of work, not very elegant. Also, initNodeStyle() has more users than I thought, even in odx_fb2TitleHandler::makeSection(), which I totally have no idea of.

  3. for small books, or books with really simple css, the added complexity could make things slower (but they should already be fast to open and render, the user won't notice anyway).

poire-z commented 10 months ago

likely caused by ancestor rules (other combinators may contribute too, but ancestor rules are more common). They are not slow per se (if not persisted), but can accumulate to several % to total load and render time.

Well, that's expected they are slow, given the multiple parent paths we need to check.

I somehow can follow the rest of your ideas - but I don't really want to get much involved in these other complex node & style topics for now :) It's already complex, and I'm not sure I'd want more kludge if it is not elegant and not too intrusive.

I was just focused on the scope of this PR, avoiding string comparisons in classname checks. And it feels the idea I suggested above could be really not intrusive, letting as much stuff in lvstsheet.cpp, hopefully just adding a method in lvtinydom.cpp.

bbshelper commented 10 months ago

Well.. misunderstanding again.. maybe I was too preoccupied by the whole FatNode thing :p

I'd say your idea makes sense. Actually I had a failed attempt to completely remove the class string check. I'll do some experiments and come back to you.

bbshelper commented 9 months ago

Tweaking my initial implementation didn't deliver a better result than the simpler approach. It's still slower with my test set. So please consider merging my latest commit. Applying the quick check to selector_id has measurable speedup.

poire-z commented 9 months ago

So please consider merging my latest commit.

I did some poor-man timings checks on 2 documents (one with some ancestors class selectors, timing around 1.6s - the other without much CSS but many nodes, timing around 10s), and I do observe (sometimes, not always, I don't often get consistent timings... dunno why) some noticable speedup, so I could be fine with that.

Tweaking my initial implementation didn't deliver a better result than the simpler approach. It's still slower with my test set.

I'm not sure you got what I had in mind, and if you ended up testing what I imagined, so I went coding it, picking stuff from your initial implementation. Here's my changes over the current state of this PR: pr545_classHashes.diff.txt

Somehow, I don't notice the speedup I got with your current PR :/ I'm not sure why. On some document, I get 83 misses (so classhashes computed and put in our LVHashTable of size 128), and 3 011 599 hits (so, classhashes fetches from that cache). The checks & fetches shouldn't take many loops if we have 83 items in the 128-modulo-hashes slots...

Can you check how this patch does with your benchmarking tools ? And if you don't observe any speedup, any idea why if brings no efficiency, and where's the bottleneck ?

bbshelper commented 9 months ago

I ran your patch through my test epubs. It was slower. I've attached the profiling result of one epub here: pr545.tgz. It's chosen because it has the median performance change among the 100 epubs. You may check it with KCacheGrind.

  1. It's probably not a good idea to remove quickClassCheck(). It can reduce 90% calls to LVCssSelector::check(), and it is much faster. There are several checks before the switch statement inside LVCssSelectorRule::check(), which takes time.
  2. The change inside cssrt_class case doesn't look like a clear win over plain string comparison. getClassHashes() already takes 69 cycles avg.
  3. Different strings can have same hash value (collision). I don't think it's functionally correct to return results solely based on hash comparison, though I don't see any regressions in my test epubs.

I can share with you my experiment with my initial implementation (and you're right: it's indeed not what you had in mind): https://github.com/bbshelper/crengine/commit/e270699743e206f950657999021748b7eaf8c4b3. Unfortunately It's also slightly slower than the current PR commit.

I have also prototyped with the FatNode thing, and with cache in place, the above commit is only slightly faster. So I decided not pursue further at this moment.

poire-z commented 9 months ago

OK, thanks for your tests and thoughts. So, let's go with that, and call it pause :)