laktak / chkbit

Check your files for data corruption
MIT License
115 stars 7 forks source link

Secure hash functions #3

Closed lgarron closed 2 years ago

lgarron commented 3 years ago

I noticed from the documentation that this project uses MD5 hashes. While I understand that this works well enough to catch unintentional corruption, I'm wondering if it would be possible to support secure hash functions going forward, such as:

I understand that it's not a primary goal to protect against adversarial attacks, but I can also imagine some situations (e.g. content-addressed storage) where a program might accidentally swap a file with another that has the same hash. In general, it would be safest if there is no known way to create any files with collisions. My understanding is that these hashes would not cost a significant performance penalty on modern computers.

laktak commented 3 years ago

I'm not sure about that use case. In case of an attack the index file is not secured and can simply be updated with the new hash. This is actually a feature because if you make modifications to the file (updated modified date) you don't want to get false error messages for these intentional edits.

Accidental identical hashes do not seem likely to me as in addition the hash there are more bits (the file size/date) that must match as well.

Joshfindit commented 2 years ago

Even if there is no risk of malicious collisions, MD5 has shown to have natural collisions "in the wild" and that makes it feel risky for verifying large backups.

In my own tools, I originally followed git's choice of using SHA-1 hashes due to SHA-1's speed and have since converted to SHA-256 since SHA-1 has been compromised. The other thing I do is append the filesize in bytes so that it's <hash>.<length>. This makes accidental (or even malicious) collisions practically impossible. Can't remember where I picked that up, but it's brought tremendous peace of mind.

laktak commented 2 years ago

Hashlib offers sha1(), sha224(), sha256(), sha384(), sha512() (for Python 3.6+).

So I think that could be implemented without any issues. Let me know if you are working on a PR.

laktak commented 2 years ago

I've added an --algo sha512 switch. @lgarron @Joshfindit can you do a test?

laktak commented 2 years ago

Published as 2.2.0.

coelner commented 2 years ago

The use of cryptographical hashes in this case is to much, better use a generic hash like xxhash: https://github.com/Cyan4973/xxHash

It is used in btrfs for integrity checks, maybe a consideration to implement.

I guess that a global export file could contains sha based values, which are generated on the first run/new added files in combination with a xxhash. For each scrub over the files only xxhash gets checked, because it is fast. The only persistent way to preserve integrity data over filesystem boundaries is to attach the hash to the filename, at least like git commit where are only shown the first x bytes of the commit hash.

Bit rot indicates that there are only a few flipping bits. Therefore the collision propabilities are tiny. To protect against malicious attacks you need something like dm-verity/fsverity.

MD5/SHA are prone to preimage attacks. That should not be a problem for this case. Just put the backup into a encrypted container.

Offtopic: for restoring use an integrity based filesystem like btrfs, zfs, lvm2 together with RAID (block device based) or snapraid for the filesystem based turn.

laktak commented 2 years ago

I think md5 is good for this case and it is used by default. But supporting other methods is not a big deal and everybody has different needs.

xxhash sounds interesting and if it is significantly faster than md5 I might take a look at it sometime.