koreader / crengine

This is the KOReader CREngine fork. It cross-pollinates with the official CoolReader repository at https://github.com/buggins/coolreader, in case you were looking for that one.
70 stars 45 forks source link

Optimize `LVHTMLParser::CheckFormat()` #537

Closed bbshelper closed 10 months ago

bbshelper commented 10 months ago

This is a series of commits superseding https://github.com/koreader/crengine/pull/497

Current implementation works as follows:

  1. run AutodetectEncoding(). It detects BOM, runs heuristics on 128K bytes if not found.
  2. use the previous result to convert input bytes to char32_t
  3. scan for tags. if it's (x)html then look further for encoding information. If charset/encoding is found in html, it prevails.

The problem with this approach is:

The new implementation is (very) loosely based on the HTML standard.

Tests:


This change is Reviewable

poire-z commented 10 months ago

Haven't yet read the whole thing, but about use std::string_view to avoid string copy:

There was originally no use of std:: in crengine - the original author implemented its own string library, defined in crengine/include/lvstring.h. I'd like if we keep that spirit and not introduce std:: stuff when not strictly needed. I dunno if it would change libcrengine.so file size or dependancies, but at least, it keeps the user reading that code from having to look at 2 documentations (lvstring.h, and the std:: documentation).

I don't think the place where you introduced it is really a hot path that would need optimisation, so you could just drop them - but otherwise, you could/should implement bits of the idea of this std::string_view in our lvstring.h/cpp.

(The only occurences of std:: in the current code is some std::cerr leftovers in crengine/qimagescale/qimagescale.cpp - and my std::unique_ptr<lunasvg::Document> lunasvg_doc = nullptr; because the external API requires them and I didn't manage to find another way to do it :/.)

NiLuJe commented 10 months ago

crengine/qimagescale/qimagescale.cpp

That may also go away soon; stb_image_resize was recently updated, and it looks pretty interesting both in terms of performance and API, so I may move to that if I ever find the time to test this ;).

bbshelper commented 10 months ago

There was originally no use of std:: in crengine - the original author implemented its own string library, defined in crengine/include/lvstring.h.

Yes. lString32 is basically std::shared_ptr<std::u32string>, and from a modern view, the shared_ptr part is a misuse. Not criticism though, it was a popular memory management technique back then.

I'd like if we keep that spirit and not introduce std:: stuff when not strictly needed.

I already refrained from introducing STL stuff. To be frank, crengine's offering is painful to use (LVHashTable in particular), and makes contribution harder than it should be.

I dunno if it would change libcrengine.so file size or dependancies

In my build, its size increased from 27,753,896 to 27,787,520 (0.12%)

but at least, it keeps the user reading that code from having to look at 2 documentations (lvstring.h, and the std:: documentation).

It's because they are different, and there is no equivalence in lvstring.h.

I don't think the place where you introduced it is really a hot path that would need optimisation, so you could just drop them

I didn't think so either, until profiling shows otherwise. The patch could improve LoadDocument() by 1%. And the patch is simple, it's just some mechanical changes.

but otherwise, you could/should implement bits of the idea of this std::string_view in our lvstring.h/cpp.

Please don't. This would just be a waste and error-prone. There's simply no point in replicating the STL stuff.

my std::unique_ptr lunasvg_doc = nullptr; because the external API requires them and I didn't manage to find another way to do it :/.)

You probably shouldn't even try that. :p std::unique_ptr clearly marks the ownership of resources and is handy for correct memory management. Its use should be preferred over new/delete.

To be clear, while I see benefits in adopting STL in crengine, I'm not an STL evangelist. I'm not arguing for a rewrite of existing code, if it works well, but I do hope you can be more open to modern C++ stuff in new code. The standard evolves for a reason. Working with legacy code is hard and I appreciate your effort, but please don't force me to add more technical debt to the code base.

NiLuJe commented 10 months ago

FWIW, I heartily agree with your conclusion ;).

poire-z commented 10 months ago

FWIW, I heartily agree with your conclusion ;).

Which part ? That we should allow people adding new technical debt to the code base ?

To be frank, crengine's offering is painful to use (LVHashTable in particular), and makes contribution harder than it should be. Working with legacy code is hard and I appreciate your effort, but please don't force me to add more technical debt to the code base.

I understand your frustration. But I'm not experienced enough to feel/live it. The point is that I'm the only person to have a feel of the whole crengine stuff (and still, there are many things that are still very obscure) and is able to review and guess/feel when things could go bad, because I have used and got to be confortable with the existing stuff. I have zero experience with C++ in other contexts, and don't plan on getting more (it is not my profession) :/ My objective is to make KOReader work, and introduce/improve high level rendering stuff, that we as users will benefit from. And I'm somehow responsible for you other contributors to not break things :) so you don't break the fragile stuff I had managed to myself not break over the years :) I don't have much experience with low-level stuff, so I welcome your and @benoit-pierre recent improvements - even if I'm not always competent to review them. So I have to trust you both :) But I can somehow manage to follow what you do because you keep using the low level lString stuff I know a bit. (But, having no such experience, being able to look at the included low-level inplementation in lvstring, lvhashtable... has given me some insights about how this low level stuff work, and cost - which is not the case with std:: stuff, where I only can read and trust the doc.) There are/were also less experienced contributors (and that includes me :) that would just go on copying and pasting existing bits of code, and expect that if it works there, it will work here.

If you go with introducing std::, it will be something new to me, that I can't vouch for (even if everybody trust libstd, I don't like to introduce opaque things). And it will be more occasions for future less experienced contributors to copy and paste these, and harder to review for me, and to somehow keep crengine clean and consistent. Even if it is frustrating, I think it's easier for you experienced coders to downgrade your skills and adjust to the context, than for me and newcomers to judge what's best, and go on playing and copy and paste samples of the std:: usage from stackoverflow.com :) without being sure it's really better.

It's because they are different, and there is no equivalence in lvstring.h. And the patch is simple, it's just some mechanical changes.

I had to go read a bit about std::string_view. And it doesn't feel it would change much of what's happening in the code were you use it: we don't do any string copy, the existing :pos() stuff just do scans, it can't really do less - so I expect string_view to also have to do the same :/ It feels like just your developer preference :/

The patch could improve LoadDocument() by 1%.

Is it worth it? What it tomorrow you find out that a recent absl::string_better_faster_view (absl is another implementation I guess, from the link you posted) is 2% better ? Should we link against that other library ? And if after tomorrow you find out that glib::parseHtmlHeader() is 10% better? Should we link against that fat libglib ?

I'm not arguing for a rewrite of existing code, if it works well, but I do hope you can be more open to modern C++ stuff in new code.

(That bit you tweaked is existing code - so may be not the candidate to create this precedent :)

Working with legacy code is hard and I appreciate your effort, but please don't force me to add more technical debt to the code base.

I understand, and it is hearbreaking :), for one of us depending on the way we go :/

And what if tomorrow you move to other things, and we/I am left with having to maintain the bits of code you tweaked? Or just live with these opaque stuff - that is slower with that version of libstdc++ we have to use on Kindle, but is fast with the one we get to use on Kobo?

Anyway, code quality/technicalities is one thing. The (possibly low) quality of the people left around to keep the project alive and working is another important one :)

Anyway, I've state my point of view/state of mind, and I'm not for taking that step (for the possibly wrong reasons stated above :). But I'm not the only one on the project, so feedback/thoughts/responsabilities from @NiLuJe @Frenzie @benoit-pierre welcome :) (I really don't want you to run away - and I've read in ebooks one can go on living and reviewing even with a broken heart :)

Frenzie commented 10 months ago

I'm not sure if I'm necessarily the right person to opine on crengine from that particular angle, but from what I've seen the STL stuff can often be slow, broken, incomplete, opaque and confusing. There's a reason everybody uses Boost (or perhaps I should even say: not-C++ but e.g. Java) instead, n'est-ce pas? (Said opinion may or may not apply to C++17, and was purposefully written provocatively. In this case it looks innocent and light-weight.)

However, the knife cuts both ways. The downsides may hardly be big enough to justify doing something custom instead of just using what's there out of the box.

bbshelper commented 10 months ago

Before going on with the discussion, please be aware that I have huge respect for what you've achieved. I'm obviously not a native English speaker so it is quite challenging to express myself accurately :)

But I'm not experienced enough to feel/live it.

I'm not an C++ expert either. Haven't used it in any serious project since like 2009. That was before C++11, a major revamp of the language. STL is way easier than CREngine Template Library:tm: (CTL) You are more than competent to master it very shortly.

which is not the case with std:: stuff, where I only can read and trust the doc.)

You can trust STL, it is well documented and tested.

And it will be more occasions for future less experienced contributors to copy and paste these, and harder to review for me, and to somehow keep crengine clean and consistent.

There might be 1k times more people with STL knowledge than with CTL :) It is especially true for beginners: nowadays every C++ introductory book have STL content.

I had to go read a bit about std::string_view. And it doesn't feel it would change much of what's happening in the code were you use it: we don't do any string copy, the existing :pos() stuff just do scans, it can't really do less - so I expect string_view to also have to do the same :/

string_view doesn't allocate and copy the bytes. It's just a pointer and a size.

The patch could improve LoadDocument() by 1%.

Is it worth it?

In this particular case I believe yes. It's just some mechanical change and note the 1% is against LoadDocument(), not just CheckFormat().

What it tomorrow you find out that a recent absl::string_better_faster_view (absl is another implementation I guess, from the link you posted) is 2% better ? Should we link against that other library ? And if after tomorrow you find out that glib::parseHtmlHeader() is 10% better? Should we link against that fat libglib ?

It's hard to answer this hypothetical question :) But in general, there are many tradeoffs in programming. Decisions should be made on a case by case basis.

I'm not arguing for a rewrite of existing code, if it works well, but I do hope you can be more open to modern C++ stuff in new code.

(That bit you tweaked is existing code - so may be not the candidate to create this precedent :)

Yes and no, it's existing code but it wasn't working well (at least for performance sensitive people like me). And last year you said the same thing to my css cache patch :)

I understand, and it is hearbreaking :), for one of us depending on the way we go :/

Oh, take it easy. It's not that serious :)

And what if tomorrow you move to other things, and we/I am left with having to maintain the bits of code you tweaked?

I only write maintainable code :p

Or just live with these opaque stuff - that is slower with that version of libstdc++ we have to use on Kindle, but is fast with the one we get to use on Kobo?

Template gets instantiated at compile time. Haven't checked, but if we are using the same compiler it should exhibit the same runtime behavior.

NiLuJe commented 10 months ago

that is slower with that version of libstdc++ we have to use on Kindle

A quick correction first ;). We're no longer tributary of native (and crappy) STLs; we ship our TC's, which currently sits at GCC 11-ish at worst (or the NDK's Clang/libc++, depending on your definition of "worst" ^^). So we're good to go on being assured the STL won't blow up on us.


I definitely understand @poire-z feelings on this (and, ultimately, he's the one maintaining the code in question, and maintainability usually trumps all in such a small team as ours ;)); but I also pretty much agree with more or less all of the points @bbshelper made. I'm no fan of the STL either, but it is (generally) sound (-ish) as an all-purpose solution (Boost is... another kettle of fish, it tends to be better at doing more specific things, but it's a massive third-party dependency that should never be taken lightly). But, people generally do know it. The LV* stuff was written ages ago, and for a very specific purpose: CRe was targeting a platform with terrible standard compliance (and/or there weren't even much standards to write home about at the time ;p). On the upside, yes, you do have access to the implementation to figure out the mess. On the downside, it's utterly undocumented, unlike the STL ;).

So, yeah, ultimately, I wouldn't see the harm in introducing small (and simple and/or obvious, nothing too "modern C++" fancy) touches of STL iff it makes sense and there aren't obvious LV alternatives. Trying to re-implement "new" STL behavior into LV would be a fool's errand, though; a very bad case of NIH and a huge maintenance burden in the end ;).

Frenzie commented 10 months ago

To be clear I wasn't suggesting including Boost. :laughing:

NiLuJe commented 10 months ago

To be clear I wasn't suggesting including Boost. 😆

Oh, definitely, but it's a nice comparison to explain why it's being used in place of/in addition to the STL ;). (i.e., specific vs. all-purpose; which often explains the implementation choices in the STL, possibly making it clunkier and heavy handed with allocs).