koreader / crengine

This is the KOReader CREngine fork. It cross-pollinates with the official CoolReader repository at https://github.com/buggins/coolreader, in case you were looking for that one.
72 stars 45 forks source link

LVStream: fix and improve `LVZipDecodeStream` implementation #507

Closed benoit-pierre closed 1 year ago

benoit-pierre commented 1 year ago

Summary

The initial goal of this PR was to avoid unnecessary rewinds of ZIP streams due to the common pattern of first reading an initial chunk (to detect the encoding/format) before consuming/parsing the whole stream.

Example when loading an EPUB:

LVStreamFragment(“mimetype”)
LVZipDecodeStream(“META-INF/container.xml”)
rewind(META-INF/container.xml)
rewind(META-INF/container.xml)
rewind(META-INF/container.xml)
rewind(META-INF/container.xml)
LVZipDecodeStream(“OEBPS/content.opf”)
rewind(OEBPS/content.opf)
rewind(OEBPS/content.opf)
rewind(OEBPS/content.opf)
rewind(OEBPS/content.opf)
rewind(OEBPS/content.opf)
LVZipDecodeStream(“OEBPS/Styles/style0001.css”)
rewind(OEBPS/Styles/style0001.css)
LVZipDecodeStream(“OEBPS/Styles/style0002.css”)
rewind(OEBPS/Styles/style0002.css)
LVZipDecodeStream(“OEBPS/Images/cover00216.jpeg”)
rewind(OEBPS/Images/cover00216.jpeg)
LVZipDecodeStream(“OEBPS/Text/cover_page.xhtml”)
rewind(OEBPS/Text/cover_page.xhtml)
rewind(OEBPS/Text/cover_page.xhtml)
rewind(OEBPS/Text/cover_page.xhtml)
LVZipDecodeStream(“OEBPS/Text/part0000.xhtml”)
rewind(OEBPS/Text/part0000.xhtml)
rewind(OEBPS/Text/part0000.xhtml)
rewind(OEBPS/Text/part0000.xhtml)
LVZipDecodeStream(“OEBPS/Styles/style0001.css”)
rewind(OEBPS/Styles/style0001.css)
LVZipDecodeStream(“OEBPS/Styles/style0002.css”)
rewind(OEBPS/Styles/style0002.css)
LVZipDecodeStream(“OEBPS/Text/part0001.xhtml”)
rewind(OEBPS/Text/part0001.xhtml)
rewind(OEBPS/Text/part0001.xhtml)
rewind(OEBPS/Text/part0001.xhtml)
LVZipDecodeStream(“OEBPS/Text/part0002.xhtml”)
rewind(OEBPS/Text/part0002.xhtml)
rewind(OEBPS/Text/part0002.xhtml)
rewind(OEBPS/Text/part0002.xhtml)
LVZipDecodeStream(“OEBPS/Text/part0003.xhtml”)
rewind(OEBPS/Text/part0003.xhtml)
rewind(OEBPS/Text/part0003.xhtml)
rewind(OEBPS/Text/part0003.xhtml)
LVZipDecodeStream(“OEBPS/Text/part0004.xhtml”)
rewind(OEBPS/Text/part0004.xhtml)
rewind(OEBPS/Text/part0004.xhtml)
rewind(OEBPS/Text/part0004.xhtml)
LVZipDecodeStream(“OEBPS/Text/part0005.xhtml”)
rewind(OEBPS/Text/part0005.xhtml)
rewind(OEBPS/Text/part0005.xhtml)
rewind(OEBPS/Text/part0005.xhtml)
LVZipDecodeStream(“OEBPS/Text/part0006.xhtml”)
rewind(OEBPS/Text/part0006.xhtml)
rewind(OEBPS/Text/part0006.xhtml)
rewind(OEBPS/Text/part0006.xhtml)
rewind(OEBPS/Text/part0006.xhtml)
[…]
LVZipDecodeStream(“OEBPS/Text/part0029.xhtml”)
rewind(OEBPS/Text/part0029.xhtml)
rewind(OEBPS/Text/part0029.xhtml)
rewind(OEBPS/Text/part0029.xhtml)
LVZipDecodeStream(“OEBPS/Text/part0030.xhtml”)
rewind(OEBPS/Text/part0030.xhtml)
rewind(OEBPS/Text/part0030.xhtml)
rewind(OEBPS/Text/part0030.xhtml)
rewind(OEBPS/Text/part0030.xhtml)
LVZipDecodeStream(“OEBPS/Text/part0031.xhtml”)
rewind(OEBPS/Text/part0031.xhtml)
rewind(OEBPS/Text/part0031.xhtml)
rewind(OEBPS/Text/part0031.xhtml)
LVZipDecodeStream(“OEBPS/toc.ncx”)
rewind(OEBPS/toc.ncx)
rewind(OEBPS/toc.ncx)
rewind(OEBPS/toc.ncx)
rewind(OEBPS/toc.ncx)

Between 1 and 5 extra rewinds!

The code is now smarter, and will keep track of the bounds of the currently uncompressed chunk so seeking within those bounds does not trigger a rewind.

Loading the same EPUB:

LVStreamFragment(“mimetype”)
LVZipDecodeStream(“META-INF/container.xml”)
LVZipDecodeStream(“OEBPS/content.opf”)
LVZipDecodeStream(“OEBPS/Styles/style0001.css”)
rewind(OEBPS/Styles/style0001.css)
LVZipDecodeStream(“OEBPS/Styles/style0002.css”)
LVZipDecodeStream(“OEBPS/Images/cover00216.jpeg”)
LVZipDecodeStream(“OEBPS/Text/cover_page.xhtml”)
LVZipDecodeStream(“OEBPS/Text/part0000.xhtml”)
LVZipDecodeStream(“OEBPS/Styles/style0001.css”)
rewind(OEBPS/Styles/style0001.css)
LVZipDecodeStream(“OEBPS/Styles/style0002.css”)
LVZipDecodeStream(“OEBPS/Text/part0001.xhtml”)
LVZipDecodeStream(“OEBPS/Text/part0002.xhtml”)
LVZipDecodeStream(“OEBPS/Text/part0003.xhtml”)
LVZipDecodeStream(“OEBPS/Text/part0004.xhtml”)
LVZipDecodeStream(“OEBPS/Text/part0005.xhtml”)
LVZipDecodeStream(“OEBPS/Text/part0006.xhtml”)
rewind(OEBPS/Text/part0006.xhtml)
[…]
LVZipDecodeStream(“OEBPS/Text/part0029.xhtml”)
LVZipDecodeStream(“OEBPS/Text/part0030.xhtml”)
rewind(OEBPS/Text/part0030.xhtml)
LVZipDecodeStream(“OEBPS/Text/part0031.xhtml”)
LVZipDecodeStream(“OEBPS/toc.ncx”)

There's still an extra rewind when the encoding detection code is involved (see #497).

In the process of doing that I also fixed the CRC handling:

Performance impact

Overall, loading is slightly faster. I tested on 2 different configurations:

Loading times:

Rendering times:

For reference, if that last extra rewind wast taken care of (by reducing the encoding detection buffer size to 16K), the resulting loading times become:

Analysis of slower loading times

Because the CRC is now correctly calculated on the decompressed data, entries with a high compression ratio can be slower to read.

The 2 slowest loading times (+10.5% and +7.8%) on Kindle are example of inefficiencies in the font manager: duplicated calls to uncompress the same 4 fonts (4*81=324 and 4*27=108 times respectively). With an average compression ratio of ~3.6, calculating the CRC as a noticeable impact on performance.

Possible improvements


This change is Reviewable

poire-z commented 1 year ago

All this ZIP/CRC business is way over my head :) so I won't really be able to review, and will trust you on all this. Pinging @bbshelper (as you had your eyes on some of this previously), for more thoughts.

the case where a zip local file header CRC is not available was not handled (similarly to how the packed / unpacked sizes can be missing): I fixed and simplified the LVZipDecodeStream::Create to always use the information passed by the archive container (which normally comes from the central directory).

I remember I added some stuff to fallback to (or from, dunno the order...) local to central stuff when one or the other was absent/bad: #252. No change related to this handling, it will still do the best thing ?

  • the code to handle images sometimes decodes one 2 times (!assume_valid && !img->Decode( NULL ) in LVCreateStreamImageSource).

I think I added that recently (and I think it was even worse before in the number of decodings), and there's a bit of abstraction or/proxy/aller-retour that made some re-decoding quite hard to avoid.

IMHO, the benefit of checking the CRC are debatable: because the final check is only done once the full stream has been decompressed and already consumed, it offers no real protection against corrupted content.

Dunno. Is this CRC stuff mostly about the internal zip decoding business, or does it end being used as a hash to identify the document, ie. to name the cr3cache file ? If it's independant, well, I guess I wouldn't care: either an epub loads today and tomorrow, or it doesn't :) We can trust SDcards to not corrupt files.

(In hacking/optimizing crengine, I never thought the ZIP decoding part would take that much time - I thought the dezip was some so common and quick stuff, vs. everything else crengine has to do with its DOM and rendering stuff...)

benoit-pierre commented 1 year ago

the case where a zip local file header CRC is not available was not handled (similarly to how the packed / unpacked sizes can be missing): I fixed and simplified the LVZipDecodeStream::Create to always use the information passed by the archive container (which normally comes from the central directory).

I remember I added some stuff to fallback to (or from, dunno the order...) local to central stuff when one or the other was absent/bad: #252. No change related to this handling, it will still do the best thing ?

Yes: fallback from central directory to local header information.

IMHO, the benefit of checking the CRC are debatable: because the final check is only done once the full stream has been decompressed and already consumed, it offers no real protection against corrupted content.

Dunno. Is this CRC stuff mostly about the internal zip decoding business, or does it end being used as a hash to identify the document, ie. to name the cr3cache file ? If it's independant, well, I guess I wouldn't care: either an epub loads today and tomorrow, or it doesn't :) We can trust SDcards to not corrupt files.

The CRC from the ZIP metadata is used anyway, not the calculated one. So I'm really not sure it worth calculating. After corrupting an EPUB HTML entry for testing, there does not seem to be any impact beside the possible error trace…

Additionally, the CRC is still not calculated if the entry is "stored" (no compression: different code path).

(In hacking/optimizing crengine, I never thought the ZIP decoding part would take that much time - I thought the dezip was some so common and quick stuff, vs. everything else crengine has to do with its DOM and rendering stuff...)

Up to 5 unnecessary rewinds… And the remaining one (because of the big 128K encoding detection buffer size) results in fully decompressing a lot of text entries 2 times!

benoit-pierre commented 1 year ago

It seems to depend on the entry, if I simulate a CRC error on all entries, including [Content_Types].xml for a FB3 file, I get this error:

Failed recognizing or parsing this file: unsupported or invalid document.\nKOReader will exit now.

And Koreader does indeed quit… :P

NiLuJe commented 1 year ago

Like @poire-z, I only have a very vague and theoretical handle on ZIP stuff (I usually let libarchive deal with it myself ;p), but nothing jumps out at a quick glance, and I certainly don't mind dropping the Zip64 stuff (which is a right mess to handle properly anyway), as it doesn't really fit our intended use-cases anyway.

poire-z commented 1 year ago

I certainly don't mind dropping the Zip64 stuff (which is a right mess to handle properly anyway), as it doesn't really fit our intended use-cases anyway.

I would't really mind either, but that wasn't announced in the PR or commits messages :/ What's the situation ? There's only 2#if LVLONG_FILE_SUPPORT... removed, so possible stuff factorized, and there are still 9 other ones in lvstream.cpp, so not everything disappeared :) and may be Zip64 still works ?

benoit-pierre commented 1 year ago

It's not dropped, but instead of parsing the local file header (which might be missing sizes/CRC info), it's better to rely on the info passed as arguments, which comes from parsing the central directory file header. The later always has valid sizes/CRC fields.

And in the case where that central directory is corrupted, the code already falls back to trying to find/parse the local file headers.

benoit-pierre commented 1 year ago

I think the remaining question is: "to CRC or not to CRC?".

In some cases, this PR will change Koreader's behavior if a ZIP is partially corrupted.

IMHO, the best behavior, both from a user and developer standpoint, would be to still load corrupted files (as much as possible), but there should be some warnings to the user. Particularly so developers don't get bugs about weird issues that might arise from such content.

Certainly, the current behavior on master is not ideal (what CRC error? ;)).

poire-z commented 1 year ago

the best behavior, both from a user and developer standpoint, would be to still load corrupted files (as much as possible),

I agree with that. (As long as the corruption stays consistent :) and the XML we managed to parse, and the DOM we built, are reproducible - so xpath/highlights made on that corrupted file are stable.)

benoit-pierre commented 1 year ago

OK, so how about I disable the CRC calculation/check for this PR.

This will leave the behavior in case of corrupted unchanged for now.

poire-z commented 1 year ago

Fine with me I guess. Pinging @Frenzie @NiLuJe for more thoughts. If it is easy, could the CRC (or old bad stuff) be kept and just not used depending on a #define ?

benoit-pierre commented 1 year ago

It's just 5 lines of code I'll #if 0 for now.

NiLuJe commented 1 year ago

Sounds good to me ;).

Frenzie commented 1 year ago

I agree that (partially) loading corrupted files is generally better, but I don't really know if or how that might differ from the current behavior.

poire-z commented 1 year ago

It's just 5 lines of code I'll #if 0 for now.

Will you do that in this PR ? Or later ? Can we merge your 2 PRs or wait a bit?

benoit-pierre commented 1 year ago

I agree that (partially) loading corrupted files is generally better, but I don't really know if or how that might differ from the current behavior.

The current behavior is to attempt loading corrupted files since CRCs are effectively not checked.

It's just 5 lines of code I'll #if 0 for now.

Will you do that in this PR ? Or later ? Can we merge your 2 PRs or wait a bit?

This PR, done.