Optimize checksum calculation on verify

aryanA101a commented 4 months ago

Related to #614

I modified the FileImpl::Verify checksum algo to calculate the checksum in pieces of 1024 bytes to save on cpu cycles and IO requests.
Around 2.6x fast (tested on 16 core AMD EPYC 7B13 Gitpod)

Size	Previous	Optimized
31 MB	0.73s	0.27s
4 GB	96.84s	38.17s

kelson42 commented 4 months ago

@aryanA101a Thank you for your promising PR. I really appreciate this PR if confirmed to be done the right way, but probably too short to close the ticket.

aryanA101a commented 4 months ago

Thanks for the PR.

This is a tricky part here. I wonder if we don't have to change the logic by having only one loop, reading as long as we have data to read (instead of a while loop inside a for loop). Something like:

auto remaining = checkSumPos;
auto part_iter = zimFile->begin();
std::ifstream current_stream(part->second->filename(), ...);
while (remaining) {
    stream.read(reinterpret_cast<char*>(ch), min(piece_size, remaining));
    zim_MD5Update(&md5ctx, ch, stream.gcount());
    remaining -= stream.gcount();

    if (remaining == 0) {
        // we have read everything
        break;
    }
    if (!stream.good()) {
        part_iter++;
        current_stream = std::ifstream(part->second->filename(), ...);
    }
}

From the beginning, there were two loops(nested for loops). Because according to FileCompound API, we can access the zimfile in parts, and we have to read from each part.

class FileCompound : private std::map<Range, FilePart*, less_range> {
    typedef std::map<Range, FilePart*, less_range> ImplType;

  public: // types
    typedef const_iterator PartIterator;
    typedef std::pair<PartIterator, PartIterator> PartRange;

  public: // functions
    explicit FileCompound(const std::string& filename);
...

(This is the global idea, error checking and stuff have to be added)

Have you tried with a chunk bigger than 1024 ? How it behaves ?

Yes, I have tried more chunk sizes. On a `4gb` zim file.	Chunk Size	Time
128	37.77s
1024	36.89s
2048	36.63s

I think 1024 is the sweet spot here.

mgautierfr commented 4 months ago

From the beginning, there were two loops(nested for loops). Because according to FileCompound API, we can access the zimfile in parts, and we have to read from each part.

Yes, but as we refactor, we may remove the nested loops. My solution also loop on the parts, it "simply" not using a while or for.

Anyway, you last proposition is mostly good, so let's continue with it.

codecov[bot] commented 4 months ago

Codecov Report

Attention: Patch coverage is 30.00000% with 7 lines in your changes are missing coverage. Please review.

Project coverage is 58.00%. Comparing base (8d978e1) to head (5e5c0b8).

Files	Patch %	Lines
src/fileimpl.cpp	30.00%	0 Missing and 7 partials :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #861 +/- ## ========================================== - Coverage 58.04% 58.00% -0.05% ========================================== Files 101 101 Lines 4617 4622 +5 Branches 1921 1925 +4 ========================================== + Hits 2680 2681 +1 Misses 667 667 - Partials 1270 1274 +4 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

openzim / libzim

Optimize checksum calculation on verify #861

Codecov Report