Closed benoit-pierre closed 1 year ago
All this ZIP/CRC business is way over my head :) so I won't really be able to review, and will trust you on all this. Pinging @bbshelper (as you had your eyes on some of this previously), for more thoughts.
the case where a zip local file header CRC is not available was not handled (similarly to how the packed / unpacked sizes can be missing): I fixed and simplified the LVZipDecodeStream::Create to always use the information passed by the archive container (which normally comes from the central directory).
I remember I added some stuff to fallback to (or from, dunno the order...) local to central stuff when one or the other was absent/bad: #252. No change related to this handling, it will still do the best thing ?
- the code to handle images sometimes decodes one 2 times (
!assume_valid && !img->Decode( NULL )
inLVCreateStreamImageSource
).
I think I added that recently (and I think it was even worse before in the number of decodings), and there's a bit of abstraction or/proxy/aller-retour that made some re-decoding quite hard to avoid.
IMHO, the benefit of checking the CRC are debatable: because the final check is only done once the full stream has been decompressed and already consumed, it offers no real protection against corrupted content.
Dunno. Is this CRC stuff mostly about the internal zip decoding business, or does it end being used as a hash to identify the document, ie. to name the cr3cache file ? If it's independant, well, I guess I wouldn't care: either an epub loads today and tomorrow, or it doesn't :) We can trust SDcards to not corrupt files.
(In hacking/optimizing crengine, I never thought the ZIP decoding part would take that much time - I thought the dezip was some so common and quick stuff, vs. everything else crengine has to do with its DOM and rendering stuff...)
the case where a zip local file header CRC is not available was not handled (similarly to how the packed / unpacked sizes can be missing): I fixed and simplified the LVZipDecodeStream::Create to always use the information passed by the archive container (which normally comes from the central directory).
I remember I added some stuff to fallback to (or from, dunno the order...) local to central stuff when one or the other was absent/bad: #252. No change related to this handling, it will still do the best thing ?
Yes: fallback from central directory to local header information.
IMHO, the benefit of checking the CRC are debatable: because the final check is only done once the full stream has been decompressed and already consumed, it offers no real protection against corrupted content.
Dunno. Is this CRC stuff mostly about the internal zip decoding business, or does it end being used as a hash to identify the document, ie. to name the cr3cache file ? If it's independant, well, I guess I wouldn't care: either an epub loads today and tomorrow, or it doesn't :) We can trust SDcards to not corrupt files.
The CRC from the ZIP metadata is used anyway, not the calculated one. So I'm really not sure it worth calculating. After corrupting an EPUB HTML entry for testing, there does not seem to be any impact beside the possible error trace…
Additionally, the CRC is still not calculated if the entry is "stored" (no compression: different code path).
(In hacking/optimizing crengine, I never thought the ZIP decoding part would take that much time - I thought the dezip was some so common and quick stuff, vs. everything else crengine has to do with its DOM and rendering stuff...)
Up to 5 unnecessary rewinds… And the remaining one (because of the big 128K encoding detection buffer size) results in fully decompressing a lot of text entries 2 times!
It seems to depend on the entry, if I simulate a CRC error on all entries, including [Content_Types].xml
for a FB3 file, I get this error:
Failed recognizing or parsing this file: unsupported or invalid document.\nKOReader will exit now.
And Koreader does indeed quit… :P
Like @poire-z, I only have a very vague and theoretical handle on ZIP stuff (I usually let libarchive deal with it myself ;p), but nothing jumps out at a quick glance, and I certainly don't mind dropping the Zip64 stuff (which is a right mess to handle properly anyway), as it doesn't really fit our intended use-cases anyway.
I certainly don't mind dropping the Zip64 stuff (which is a right mess to handle properly anyway), as it doesn't really fit our intended use-cases anyway.
I would't really mind either, but that wasn't announced in the PR or commits messages :/
What's the situation ? There's only 2#if LVLONG_FILE_SUPPORT...
removed, so possible stuff factorized, and there are still 9 other ones in lvstream.cpp, so not everything disappeared :) and may be Zip64 still works ?
It's not dropped, but instead of parsing the local file header (which might be missing sizes/CRC info), it's better to rely on the info passed as arguments, which comes from parsing the central directory file header. The later always has valid sizes/CRC fields.
And in the case where that central directory is corrupted, the code already falls back to trying to find/parse the local file headers.
I think the remaining question is: "to CRC or not to CRC?".
In some cases, this PR will change Koreader's behavior if a ZIP is partially corrupted.
IMHO, the best behavior, both from a user and developer standpoint, would be to still load corrupted files (as much as possible), but there should be some warnings to the user. Particularly so developers don't get bugs about weird issues that might arise from such content.
Certainly, the current behavior on master is not ideal (what CRC error? ;)).
the best behavior, both from a user and developer standpoint, would be to still load corrupted files (as much as possible),
I agree with that. (As long as the corruption stays consistent :) and the XML we managed to parse, and the DOM we built, are reproducible - so xpath/highlights made on that corrupted file are stable.)
OK, so how about I disable the CRC calculation/check for this PR.
This will leave the behavior in case of corrupted unchanged for now.
Fine with me I guess. Pinging @Frenzie @NiLuJe for more thoughts.
If it is easy, could the CRC (or old bad stuff) be kept and just not used depending on a #define
?
It's just 5 lines of code I'll #if 0
for now.
Sounds good to me ;).
I agree that (partially) loading corrupted files is generally better, but I don't really know if or how that might differ from the current behavior.
It's just 5 lines of code I'll
#if 0
for now.
Will you do that in this PR ? Or later ? Can we merge your 2 PRs or wait a bit?
I agree that (partially) loading corrupted files is generally better, but I don't really know if or how that might differ from the current behavior.
The current behavior is to attempt loading corrupted files since CRCs are effectively not checked.
It's just 5 lines of code I'll
#if 0
for now.Will you do that in this PR ? Or later ? Can we merge your 2 PRs or wait a bit?
This PR, done.
Summary
The initial goal of this PR was to avoid unnecessary rewinds of ZIP streams due to the common pattern of first reading an initial chunk (to detect the encoding/format) before consuming/parsing the whole stream.
Example when loading an EPUB:
Between 1 and 5 extra rewinds!
The code is now smarter, and will keep track of the bounds of the currently uncompressed chunk so seeking within those bounds does not trigger a rewind.
Loading the same EPUB:
There's still an extra rewind when the encoding detection code is involved (see #497).
In the process of doing that I also fixed the CRC handling:
LVZipDecodeStream::Create
to always use the information passed by the archive container (which normally comes from the central directory).Performance impact
Overall, loading is slightly faster. I tested on 2 different configurations:
-ffast-math
disabled for LuaJIT or the build fails on Arch Linux…)Loading times:
Rendering times:
For reference, if that last extra rewind wast taken care of (by reducing the encoding detection buffer size to 16K), the resulting loading times become:
Analysis of slower loading times
Because the CRC is now correctly calculated on the decompressed data, entries with a high compression ratio can be slower to read.
The 2 slowest loading times (+10.5% and +7.8%) on Kindle are example of inefficiencies in the font manager: duplicated calls to uncompress the same 4 fonts (
4*81=324
and4*27=108
times respectively). With an average compression ratio of ~3.6, calculating the CRC as a noticeable impact on performance.Possible improvements
!assume_valid && !img->Decode( NULL )
inLVCreateStreamImageSource
).This change is