koreader / crengine

This is the KOReader CREngine fork. It cross-pollinates with the official CoolReader repository at https://github.com/buggins/coolreader, in case you were looking for that one.
72 stars 45 forks source link

Stylesheets are parsed repeatedly in epub. #491

Closed bbshelper closed 2 years ago

bbshelper commented 2 years ago

Suffering from https://github.com/koreader/koreader/issues/9087, I decided to look into the code.

Here is a gprof visualization, collected from a program that opens a large epub and changes font size for 1000 times: 1

It turns out that the same CSS file included in different fragments are parsed repeatedly. I believe parsing time could be reduced by introducing a cache. However, my initial attempt failed. There seems to be some interactions between ldomDocument, LVStyleSheet and the parser which i haven't fully figured out.

It's worth noting that the impact varies by the number of fragments and CSS rules. The above book has 700+ fragments and each includes ~1500 lines of CSS. For a simpler epub with 25 fragments and ~150 lines of CSS, it spends just 6% of LVDocView::Render() time in parsing.

poire-z commented 2 years ago

It turns out that the same CSS file included in different fragments are parsed repeatedly

Indeed, each CSS linked by each DocFragment is parsed, used, and dropped. So, reparsed for each DocFragment.

The CSS "parsing" phase (that your screenshot shows) may take some time, but it's usually minimal compared to the time "applying" it will take (that is, checking each node vs: each selectors). I never got the feeling some optimisation here could be needed. I guess 700 fragments with only a few text nodes each (where parsing / applying ratio would be not small) is quite a degenerate case that I think doesn't really need bothering.

Frenzie commented 2 years ago

It turns out that the same CSS file included in different fragments are parsed repeatedly.

That's true but fairly immaterial. If you create an EPUB with enormous stylesheets and thousands of XHTML docfragments, it'd be even more problematic.

But fundamentally parsing a stylesheet is just really fast compared to actually checking if a certain style applies to a node.

bbshelper commented 2 years ago

Thanks for your replies. I realized that the huge book I used for profiling maybe unrepresentative. So I measured another 2 books, which are 500 and 700 pages in paper form. The result shows that LVImportStylesheetParser::Parse alone could take 5%~8% of total LVDocView::Render time. Considering it might be down to <1% by introducing a cache, there would be 5%+ overall gain. Not bad, isn't it?

BTW, going through the code base, few utilizes STL and >c++11 features. Would you accept a PR with these, or rather not?

poire-z commented 2 years ago

BTW, going through the code base, few utilizes STL and >c++11 features. Would you accept a PR with these, or rather not?

I appreciate you noticing and asking ! :) Thank you. Short answer: rather not.

Indeed, the original russian code does not use any std:: stuff: it has implemented its own set of data structures handling, array, hash tables, shared pointers... You can find them all in crengine/include/lv*h (LV being the initials of the original author https://github.com/buggins). The only traces of std:: is for optional stuff: some stolen image smooth scaling code from QT, and SVG support via LunaSVG whose interface signatures requires std::unique_ptr. So yes, we prefer no std:: stuff for core stuff: use the LV stuff.

For what you're looking at, you would probably add some member to ldomDocument (or the right super class) of the type LVHashTable<lString8, LVStyleSheet> or some variation LVHashTable<lString32, LVStyleSheetRef> mapping the CSS path to the compiled LVStylesheet.

As you seem not afraid looking at and hacking this code, I'd like you to read my posts in https://github.com/buggins/coolreader/pull/240 , so you:

Btw, can you describe a bit what the numbersmean in your calls stats image ? image What's the 25.81% and (0.00%), and the 1000x. Is this (mostly CoolReader specific) updateBookMarksRanges/getCurrentFileHistRecord stuff really called 1000 times when rendering ? Or did you move 1000 pages forward ? It seems KOReader would not need this (but you never know with crengine internals), so I would allow and not worry about one call of this on each page turn, but if it's 25% of the rendering time, it's obviously a candidate to optimisation !

Frenzie commented 2 years ago

Considering it might be down to <1% by introducing a cache, there would be 5%+ overall gain. Not bad, isn't it?

Could you share this stylesheet? Talking in the abstract isn't very useful imho. The simple fact is you can have absolutely atrocious stylesheets that frankly aren't worthy of our time. But if it's perfectly reasonable that's another matter.

poire-z commented 2 years ago

(In case you would have/need to create a LVStyleSheet::mergeOtherStylesheet( LVStyleSheet & other_sheet ) for experimenting, sharing some old code where I implemented something that had to do some merging - that I ended up not needing and did stuff differently. Might save you a few hours, dunno. mergeStylesheetsEarlyExperiment.diff.txt )

bbshelper commented 2 years ago

LV being the initials of the original author https://github.com/buggins

Thank you for answering a question that I wanted to ask but haven't :)

As you seem not afraid looking at and hacking this code, I'd like you to read my posts in https://github.com/buggins/coolreader/pull/240 , so you:

I really appreciate your effort in improving an old code base like this. Its incredibly hard not to break things even with small and innocent changes. While I still, from time to time, have an impulse to modernize legacy code, I'll try to make my patch minimal and in line with existing code.

Regarding the numbers, 25.81% (0.00%) and 1000x, they are, respectively:

bbshelper commented 2 years ago

Could you share this stylesheet? Talking in the abstract isn't very useful imho. The simple fact is you can have absolutely atrocious stylesheets that frankly aren't worthy of our time. But if it's perfectly reasonable that's another matter.

I'm afraid the stylesheet by itself, without the complete epub, is not very interesting, just 500 lines of normal rules. The book is copyrighted so I can't share it here.

You could try this one from Project Gutenburg. My profiling result is here 2641-2

With 150 lines CSS, it spends ~5% in parsing. I also see other books as low as 1% with 80 lines of CSS.

bbshelper commented 2 years ago

Also put my profiling code here:

auto *view = new LVDocView(-1, true);
view->setViewMode(LVDocViewMode::DVM_PAGES, -1);
view->Resize(768, 1024);
view->setPageHeaderInfo(PGHDR_AUTHOR|PGHDR_TITLE|PGHDR_PAGE_NUMBER|PGHDR_PAGE_COUNT|PGHDR_CHAPTER_MARKS|PGHDR_CLOCK);

view->LoadDocument(argv[1]);

for (int i = 0; i < 500; ++i) {
    view->setFontSize(18);
    view->Render();
    view->setFontSize(16);
    view->Render();
}
poire-z commented 2 years ago

Continued/implemented in #493.