laktak / chkbit-py

Check your files for data corruption
MIT License
96 stars 6 forks source link

slower verify in version 3 #7

Closed spock closed 6 months ago

spock commented 6 months ago

Previously, running --verify-index on my 50k files - without any existing index - was taking 1m 50s with 1 worker, and 35s with 6 workers. It was simply listing all the files, and that was very fast.

Now, with version 3, verification takes about 24m with a single worker, and about 6m 40s with 6 workers; which are about the same times as for updating the index.

Looking at line 39 here https://github.com/laktak/chkbit-py/commit/a03d5b4267b1e8de23b499ea317960bd907424f1#diff-0c955c1fb53488dd1d4f7f30bec88d9339a66af65b566f0fa1375d7dc926b22bR39 , it looks like even in the read-only verification mode hashes are calculated and modified times are checked for all new files.

I think in the previous version new files during verification were simply reported: https://github.com/laktak/chkbit-py/commit/a03d5b4267b1e8de23b499ea317960bd907424f1#diff-5b1e4f1fcc838580d1fd73bbf4d2b6a8ad87d3a526fd32c45d9743ddd84ae7aaL43 (doesn't expand from the URL - indexthread.py, line 43)

spock commented 6 months ago

I can try putting

        # calc the new hashes
        index.update()

        # compare
        index.check_fix(self.context.force)

under

        # save if update is set
        if self.update:

but I need advice on how to install locally, pip install --user . from the repository folder no longer works for me (it creates chkbit and cli as two separate modules and then imports fail).

spock commented 6 months ago

I was able to move cli under chkbit (using relative imports) and thus install from the repo. Let me know if I should make a pull request out of that import reshuffling. It mostly uses from . import ..., and in one file in cli it uses from .. import ....

Applying the "fix" from my comment above results in no new files being reported at all, but yes it becomes exactly as fast as it was before (32s with 6 workers or 1m 52s with 1 worker).

I now wonder if previous version also wasn't reporting new files, if there were no .chkbit index files?

spock commented 6 months ago

I've added update argument to index.update(), so now it can skip checking mtime and calculating hashes for new files. And made a few more small changes. Now with 6 workers I get expected output, very fast:

Processed 48409 files in readonly mode.
No changes were made (specify -u to update):
- 48409 files would have been added and
- 0 files would have been updated.

real    0m33.298s

I think I have it now, but haven't yet figured out if I need to somehow fake context.hit without calculating a hash.
For now I'm just increasing cfiles count.

spock commented 6 months ago

Let me know if I should make this into a PR; it is independent of the "imports reshuffling" commit, and is rather simple.

For now I've been editing and committing directly in a cloned repository, without even making a new branch 🤦🏻‍♂️
I hope there is still an easy way to somehow cherry-pick these two commits as PRs to your repository.

laktak commented 6 months ago

Sorry for discovering this bug and thank you for looking into it ;) I wanted to simplify the cli options and my test cases did not include this situation.

I'm curious about the use case though. Are you running it on a backup medium that's not fully indexed?

About installing and running chkbit from source - I'll add it to the readme.

spock commented 6 months ago

My use-case is a "live archive", a photo collection where I'm sometimes reorganizing, editing, and adding new photos (normally in batches of dozens to hundreds, not more).

Under normal usage conditions "verifying unindexed location" shouldn't be an issue, of course; but it's also very nice if software is as fast as possible - when possible :)

I'm planning to run chkbit just before auto-copying this collection to a secondary encrypted copy.
As one option, I'm considering to check chkbit exit status to decide if backing up can proceed or should be halted for investigation.

laktak commented 6 months ago

OK, in any case, thanks for discovering and fixing it! I pushed the new version to pypi.